EP118 What are you Searching For - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

EP118 What are you Searching For

Description:

Billions of pages of text, HTML, XML. Most of them useless and ... list: 'blue or azure ' A Text Index is used by joining the source table and the index table ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 38
Provided by: frank171
Category:
Tags: azure | ep118 | searching

less

Transcript and Presenter's Notes

Title: EP118 What are you Searching For


1
EP118What are youSearching For?
Dmitry Chernizer Enterprise Systems
Architect Dcherniz_at_sybase.com
2
Agenda
  • In My generation
  • Concept Search Module
  • Content Search Module
  • Sample Configurations
  • Summary

3
The e-Volution of Unstructured Data
  • Back in my day, people used to walk to their
    information.
  • Up Hill Both Ways!
  • Now everything is at you fingertips.
  • www.Research.com
  • www.StockAdvice.com
  • www.NewJob.com
  • Billions of pages of text, HTML, XML
  • Most of them useless and out-dated information
  • Its not about the 20 docs you have
  • Its about the 5 pages you need!

4
The e-Volution of Unstructured Data How did we
get here?
Word
Crash!
Text
Relational
  • Hierarchical

PDF
HTML
5
The e-Volution of Unstructured Data What is
knowledge management?
  • A way to store non-relational (maybe
    hierarchical) data
  • A standard way to express complex relationships
  • Gather data
  • Process store it
  • Query display it
  • Ability to assign a life-cycle to a piece of
    information
  • Why you ask?
  • Because your brain works that way

6
What you need to know
  • Less than Einstein
  • More then this guy..
  • Two kinds of Search Engines
  • Concept Based Search
  • Content Based Search

7
What you need to know
Concept Based Search Deals with processing
unstructured requests Content Based Search Deal
with processing structured requests
8
Concept Based Search Why and Where does it fit
in?
Continuous Availability
Personalization
Content Management
Integration
Security
9
Concept Search Engine Basics
Note The Sybase Concept Search Engine uses
embedded technology from Autonomy
  • The Purpose
  • The Process
  • Bayesian Inference
  • Shannons Information Theory
  • Adaptive Probabilistic Concept Modeling
  • Dynamic Reasoning Engine
  • Examples

10
Concept Search Engine BasicsThe Purpose
  • Automate process of getting the right information
    to the right person
  • Improve the efficiency of information retrieval
  • Enable the dynamic personalization of digital
    content.
  • Natural language content search and retrieval
  • Automatic categorization by an agent
  • Automatic Content Personalization

11
Concept Search Engine BasicsThe process
  • Advanced Concept matching techniques
  • High-performance pattern matching algorithms
  • Can analyze a text and identify the key concepts
    within the document
  • Based on frequency and relationships of terms
    correlated with meaning
  • Language Independent

12
Concept Search Engine BasicsLimitation of
other approaches
  • Keyword, Boolean and Proximity Searches
  • Exacerbate / increase information overload
  • Cant tell how relevant a document is to subject
    being researched
  • Only track simple occurrence of keywords
  • ( e.g., "CD AND (NOT (financial OR money OR
    invest)) AND music.
  • May track proximity of content but not relevant
    content
  • Lack of localization (English Wizard hey Im
    NOT!)

13
Bayesian Inference
  • Developed by Thomas Bayes, 18th century cleric
    and mathematician
  • Central tenet of modern statistical probability
    modeling
  • Calculates probabilistic relationship between
    multiple variables and the extent to which each
    impacts the other
  • Used in pattern and fingerprint recognition

Okay maybe not this guy
14
Shannons Information Theory
  • Developed by Claude Shannon in 1949
  • Words which are less frequent across all
    documents, but appear in a cluster of documents
    are more distinguishing and tend to convey more
    information
  • Ideas can be inferred from related content
  • An inference engine may be used to parse and
    build content

15
Adaptive, Probabilistic Concept Modeling
Bayesian Inference Shannons Information
Theory Dynamic Reasoning Engine (DRE) generates
networks of concepts Terms are weighted
relationships are established The unstructured
content portal metaphor
16
Concept Search Inference EngineAlso known as
Dynamic Reasoning Engine
  • Core Engine of the Concept Search Logic
  • Uses the APCM algorithms to extract content
  • Generates relative weight of document relevance,
    base summary and/or result set (non-tabular)
  • Generates query plans for unstructured data
  • May be stored as Templates for reusable queries
  • May be used by agent processes for aggregation
  • Accessed thru Enterprise Portal Search API

17
Auto Indexer
  • Automatically gathers text content from local
    file systems and imports external files into an
    index
  • Can gather document sets in a local file system
  • Can spider mapped drives
  • Can load a single document as discrete sets
  • Uses Verity, Keyview Adobe filters,
  • To work on ASCII text
  • Will continually check for new content

18
Agent Process
  • Automated Content Categorizer
  • Stores categories or reusable queries known as
    Agents
  • Agents can be shared or used to find people with
    similar interests

19
Knowledge Fetch Process
  • Allows auto- spidering of web sites to gather
    data
  • Converts web content to index able format
  • May be used to Fetch content from many sites
    simultaneously
  • Can return meta-data and conventional text
    content
  • Obtains Web Pages behind Firewalls and through
    Proxy Servers
  • Obtains Web Pages protected by a login
  • Obtains Web Pages using Cookies

20
The Knowledge Management ServerA portal Service
Sybase Enterprise Portal
Open Client IBM DRDA SQLNet ODBC/JDBC
File I/O POP3 Exchange Lotus Notes
Application Service Engine
HTTP HTTPS
HTTP HTTPS
Word
Back
21
Enterprise Portal Search Services
  • Encapsulate Search API into a set of EP
    components
  • Components can be accessed by other EP services,
  • such as security servlets, messaging or other
    EJBs
  • Allows load balancing across server clusters
  • Secure Search and Profile Locking
  • Allows extending of the Dynamic Reasoning Engine
    via ANY component model (Java, C, ActiveX, Server
    Side Java Script, etc.)

22
Sample Architectures
Load Balancing Hardware
Firewall
Client
Web Server Presentation Layer
External Spider Agent
Concept Search Inference Engine
Application Engine
Knowledge Server Agents
Knowledge Server
Internal Spider Agents
Fetch Agent
Fetch Agent
Unstructured Data Repositories
Data repository
Intranet
DMZ Ring
Back
23
Storage Overhead
No content stored, just terms wts 30 - 50 of
original document size Content stored, plus terms
wts 150 of original document size Content,
proximity phrase matching, and terms
wts 250 of original document size
24
Content Based Search Why and Where does it fit
in?
Continuous Availability
Personalization
Content Management
Integration
Security
25
Content Full Text Search Engine Basics
Note The Sybase Content Full Text Search Engine
uses embedded technology from Verity
  • The Purpose
  • The Process
  • Content Search Basics
  • Full Text Search Specialty Data Store
  • Sample Architectures

26
Content Search Engine BasicsThe Purpose
  • Structured (SQL) Access to Unstructured Data
  • Adaptive Server (or EP) indexes documents stored
    in external data stores
  • Indexes are maintained within a collection
  • It understands words and language constructs
  • It understands many document types e.g. MS Word,
    html, sgml, pdf, etc

27
Content Search Engine BasicsThe Process
Sybase Enterprise Portal
Application Service Engine
SQL Query
Specialty Data Store
Text
Word
28
Content Search Engine BasicsThe Process
  • Queries are issued against a collection
  • Results include a document identifier and a score
  • Score indicates how well a document matched the
    query
  • Can understand and index many foreign languages
  • Include rules for understanding words and
  • constructs of the specified language

29
Content Search Engine BasicsThe Process
  • Queries are issued against a collection
  • Results include a document identifier and a score
  • Score indicates how well a document matched the
    query

Collection - A
Find documents where blue is near red
ID 68, score98 ID17, score 71
30
Content Search Engine BasicsThe Process
  • Can understand and index many foreign languages
  • Include rules for understanding words and
    constructs of the specified language

Hola! Bon Jiorno! Mahalo! Kem-Cho!
31
Content Search Engine BasicsThe Process
Indexed data and index in two separate data stores
Indexed Data
Indexed Data
Updates, synchronization, backup, recovery?
32
Full Text Search is a Specialty Data StoreYes
but..
  • Data Store propagates source changes to the
    collection
  • An events table (text_events) is used to log
    changes to the source tables
  • Data Store must be notified that changes exist
  • Backups of both data stores must be synchronized

33
Full Text Search is a Specialty Data
StoreSybase Provides
Integrated backup and restore facility Backup /
Recover database and text indexes Online
configuration Configure Full Text Search at
runtime
dump database...
34
Enhanced Full Text Search Features
Clustering a feature for grouping similar
documents Clusters are inherently fuzzy - the
algorithm merely attempts to group similar
documents Query By Example provides ability to
search for documents that are similar to one or
more segments of text select summary, score,
copy from t1 t, vt1 v where t.id v.id
and index_any ltlikegt (Space the final
frontier)
35
Enhanced Full Text Search Features
  • Custom Thesaurus allows users to build a
    thesaurus specific to their application.
  • Synonym Maps for proximity search
  • control 1synonyms(list red, ruby, scarlet,
    fuchsia, magenta
  • list blue ltorgt azure )
  • A Text Index is used by joining the source table
    and the index table
  • select score, copy from story_index i, stories
    s where i.id s.id and i.score gt 70
    and i.index_any Digital ltneargt Compaq

36
Summary
  • Sybase provides 2 types of Knowledge Management
  • Concept Search
  • Content Search
  • Technology Futures include an unstructured data
    server, XML search and indexing, XSL translation
    and other ways of managing hybrid data.

37
Summary
  • Yes it can be done
  • Content, Concept We have it all
Write a Comment
User Comments (0)
About PowerShow.com