EP118 What are you Searching For - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

EP118 What are you Searching For

Description:

Billions of pages of text, HTML, XML. Most of them useless and ... list: 'blue or azure ' A Text Index is used by joining the source table and the index table ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 38

Provided by: frank171

Category:

more less

Transcript and Presenter's Notes

Title: EP118 What are you Searching For

1
EP118What are youSearching For?
Dmitry Chernizer Enterprise Systems
Architect Dcherniz_at_sybase.com
2
Agenda

In My generation
Concept Search Module
Content Search Module
Sample Configurations
Summary

3
The e-Volution of Unstructured Data

Back in my day, people used to walk to their
information.
Up Hill Both Ways!
Now everything is at you fingertips.
www.Research.com
www.StockAdvice.com
www.NewJob.com
Billions of pages of text, HTML, XML
Most of them useless and out-dated information
Its not about the 20 docs you have
Its about the 5 pages you need!

4
The e-Volution of Unstructured Data How did we
get here?
Word
Crash!
Text
Relational

Hierarchical

PDF
HTML
5
The e-Volution of Unstructured Data What is
knowledge management?

A way to store non-relational (maybe
hierarchical) data
A standard way to express complex relationships
Gather data
Process store it
Query display it
Ability to assign a life-cycle to a piece of
information
Why you ask?
Because your brain works that way

6
What you need to know

Less than Einstein
More then this guy..
Two kinds of Search Engines
Concept Based Search
Content Based Search

7
What you need to know
Concept Based Search Deals with processing
unstructured requests Content Based Search Deal
with processing structured requests
8
Concept Based Search Why and Where does it fit
in?
Continuous Availability
Personalization
Content Management
Integration
Security
9
Concept Search Engine Basics
Note The Sybase Concept Search Engine uses
embedded technology from Autonomy

The Purpose
The Process
Bayesian Inference
Shannons Information Theory
Adaptive Probabilistic Concept Modeling
Dynamic Reasoning Engine
Examples

10
Concept Search Engine BasicsThe Purpose

Automate process of getting the right information
to the right person
Improve the efficiency of information retrieval
Enable the dynamic personalization of digital
content.
Natural language content search and retrieval
Automatic categorization by an agent
Automatic Content Personalization

11
Concept Search Engine BasicsThe process

Advanced Concept matching techniques
High-performance pattern matching algorithms
Can analyze a text and identify the key concepts
within the document
Based on frequency and relationships of terms
correlated with meaning
Language Independent

12
Concept Search Engine BasicsLimitation of
other approaches

Keyword, Boolean and Proximity Searches
Exacerbate / increase information overload
Cant tell how relevant a document is to subject
being researched
Only track simple occurrence of keywords
( e.g., "CD AND (NOT (financial OR money OR
invest)) AND music.
May track proximity of content but not relevant
content
Lack of localization (English Wizard hey Im
NOT!)

13
Bayesian Inference

Developed by Thomas Bayes, 18th century cleric
and mathematician
Central tenet of modern statistical probability
modeling
Calculates probabilistic relationship between
multiple variables and the extent to which each
impacts the other
Used in pattern and fingerprint recognition

Okay maybe not this guy
14
Shannons Information Theory

Developed by Claude Shannon in 1949
Words which are less frequent across all
documents, but appear in a cluster of documents
are more distinguishing and tend to convey more
information
Ideas can be inferred from related content
An inference engine may be used to parse and
build content

15
Adaptive, Probabilistic Concept Modeling
Bayesian Inference Shannons Information
Theory Dynamic Reasoning Engine (DRE) generates
networks of concepts Terms are weighted
relationships are established The unstructured
content portal metaphor
16
Concept Search Inference EngineAlso known as
Dynamic Reasoning Engine

Core Engine of the Concept Search Logic
Uses the APCM algorithms to extract content
Generates relative weight of document relevance,
base summary and/or result set (non-tabular)
Generates query plans for unstructured data
May be stored as Templates for reusable queries
May be used by agent processes for aggregation
Accessed thru Enterprise Portal Search API

17
Auto Indexer

Automatically gathers text content from local
file systems and imports external files into an
index
Can gather document sets in a local file system
Can spider mapped drives
Can load a single document as discrete sets
Uses Verity, Keyview Adobe filters,
To work on ASCII text
Will continually check for new content

18
Agent Process

Automated Content Categorizer
Stores categories or reusable queries known as
Agents
Agents can be shared or used to find people with
similar interests

19
Knowledge Fetch Process

Allows auto- spidering of web sites to gather
data
Converts web content to index able format
May be used to Fetch content from many sites
simultaneously
Can return meta-data and conventional text
content
Obtains Web Pages behind Firewalls and through
Proxy Servers
Obtains Web Pages protected by a login
Obtains Web Pages using Cookies

20
The Knowledge Management ServerA portal Service
Sybase Enterprise Portal
Open Client IBM DRDA SQLNet ODBC/JDBC
File I/O POP3 Exchange Lotus Notes
Application Service Engine
HTTP HTTPS
HTTP HTTPS
Word
Back
21
Enterprise Portal Search Services

Encapsulate Search API into a set of EP
components
Components can be accessed by other EP services,
such as security servlets, messaging or other
EJBs
Allows load balancing across server clusters
Secure Search and Profile Locking
Allows extending of the Dynamic Reasoning Engine
via ANY component model (Java, C, ActiveX, Server
Side Java Script, etc.)

22
Sample Architectures
Load Balancing Hardware
Firewall
Client
Web Server Presentation Layer
External Spider Agent
Concept Search Inference Engine
Application Engine
Knowledge Server Agents
Knowledge Server
Internal Spider Agents
Fetch Agent
Fetch Agent
Unstructured Data Repositories
Data repository
Intranet
DMZ Ring
Back
23
Storage Overhead
No content stored, just terms wts 30 - 50 of
original document size Content stored, plus terms
wts 150 of original document size Content,
proximity phrase matching, and terms
wts 250 of original document size
24
Content Based Search Why and Where does it fit
in?
Continuous Availability
Personalization
Content Management
Integration
Security
25
Content Full Text Search Engine Basics
Note The Sybase Content Full Text Search Engine
uses embedded technology from Verity

The Purpose
The Process
Content Search Basics
Full Text Search Specialty Data Store
Sample Architectures

26
Content Search Engine BasicsThe Purpose

Structured (SQL) Access to Unstructured Data
Adaptive Server (or EP) indexes documents stored
in external data stores
Indexes are maintained within a collection
It understands words and language constructs
It understands many document types e.g. MS Word,
html, sgml, pdf, etc

27
Content Search Engine BasicsThe Process
Sybase Enterprise Portal
Application Service Engine
SQL Query
Specialty Data Store
Text
Word
28
Content Search Engine BasicsThe Process

Queries are issued against a collection
Results include a document identifier and a score
Score indicates how well a document matched the
query
Can understand and index many foreign languages
Include rules for understanding words and
constructs of the specified language

29
Content Search Engine BasicsThe Process

Queries are issued against a collection
Results include a document identifier and a score
Score indicates how well a document matched the
query

Collection - A
Find documents where blue is near red
ID 68, score98 ID17, score 71
30
Content Search Engine BasicsThe Process

Can understand and index many foreign languages
Include rules for understanding words and
constructs of the specified language

Hola! Bon Jiorno! Mahalo! Kem-Cho!
31
Content Search Engine BasicsThe Process
Indexed data and index in two separate data stores
Indexed Data
Indexed Data
Updates, synchronization, backup, recovery?
32
Full Text Search is a Specialty Data StoreYes
but..

Data Store propagates source changes to the
collection
An events table (text_events) is used to log
changes to the source tables
Data Store must be notified that changes exist
Backups of both data stores must be synchronized

33
Full Text Search is a Specialty Data
StoreSybase Provides
Integrated backup and restore facility Backup /
Recover database and text indexes Online
configuration Configure Full Text Search at
runtime
dump database...
34
Enhanced Full Text Search Features
Clustering a feature for grouping similar
documents Clusters are inherently fuzzy - the
algorithm merely attempts to group similar
documents Query By Example provides ability to
search for documents that are similar to one or
more segments of text select summary, score,
copy from t1 t, vt1 v where t.id v.id
and index_any ltlikegt (Space the final
frontier)
35
Enhanced Full Text Search Features

Custom Thesaurus allows users to build a
thesaurus specific to their application.
Synonym Maps for proximity search
control 1synonyms(list red, ruby, scarlet,
fuchsia, magenta
list blue ltorgt azure )
A Text Index is used by joining the source table
and the index table
select score, copy from story_index i, stories
s where i.id s.id and i.score gt 70
and i.index_any Digital ltneargt Compaq

36
Summary

Sybase provides 2 types of Knowledge Management
Concept Search
Content Search
Technology Futures include an unstructured data
server, XML search and indexing, XSL translation
and other ways of managing hybrid data.

37
Summary