Ontologies, Taxonomies and Search

About This Presentation

Title:

Ontologies, Taxonomies and Search

Description:

... know how it has been configured or what components it has ... Ranganathan is the supreme authority on how to create a well-design classification scheme ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 57

Provided by: clifford99

Category:

more less

Transcript and Presenter's Notes

Title: Ontologies, Taxonomies and Search

1
Ontologies, Taxonomies and Search

Denise Bedford
Special Libraries Association
Baltimore, Maryland
June, 2006

2
Presentation Overview

I will talk today from the end game perspective
of search placing taxonomies and ontologies in
that context
Search User Needs
Search Functional Architecture
Metadata Categorization
Semantic Technologies

3
Search User Needs
4
Basic Assumptions About Search Systems

Search systems are not WYSIWYG what you see is
not what you get
Search systems are more like ice-bergs - most of
the search system is below the surface you
cant see it, you dont know how it has been
configured or what components it has
Helpful to have a framework against which you can
judge the suitability of any search tool to your
environment
Assess the search environment before you decide
what kind of a search system you need

5
The Environment

Like other libraries we acquire, create, license,
access information in 30 subject domains
ranging from Law Justice, to Transport, to
Environment, to Agriculture, to Education, Health
Nutrition, etc.
We have 500 different kinds of content/document
types
We have a set of business processes which
represent the way we do our work
We have six working languages but actually
working in many more
We have a rich history content dating back to
the 1940s
Our priorities change over time
Everyone in the Bank is a recognized
international expert

6
Underlying Search Architecture
Silos of Information Multiple search engines
and user interfaces
7
Search Environment

It is very challenging to design a search system
which meets the needs of this complex environment
Internal Experts do known item searching and
look for other experts to talk to
General public looks for information about
something
Most people dont know all the dimensions of the
Bank
People need to search in other than English and
find content written in other languages
Our new Knowledge and Learning initiative is
moving us towards a W3 working environment -
What information I need when I need it, where I
need it
We need a search architecture that supports the
users needs

8
Information Architecture Moving Towards
Contextualization
3W Information Model What I Want Where I
Want When I Want
Customized Access to Information
myPortal Client Connections
Operations Portal myPortal
Individualized Access (Context Sensitive
Individual)
Structured
Intranet External Web
Group Access (Context Sensitive Group)
Anonymous Access (Context Free)
Content Information Assets
Less structured
Basic Information
Customer Information
People
9
Search Use Case Analysis

Why Search Wasnt Working

10
Search Has Always Been Problemmatic

Despite any reports you may read in the general
literature, buying a fast crawler will not solve
your search problems
Implementing a fast crawler simply surfaces
information management and data quality
challenges directly to the users
The crawler approach provides high hit rates, and
very low relevancy rates
It was a challenge to understand the search
problem from the system-side
Try to understand the search problem from the
users side

11
Consultant is looking for the CAS for the Congo.
Steps 1 - 2
F
DEC Consultant
Intranet Search
Steps 4-6
F
ImageBank Search
12
Adviser wants to find social indicators, relevant
quotes and Banks position on racism, as well as
learn what the Bank has done in this area in
order to prepare a 30 min. speech for a WB
Managing Director for conference.
13
What We Learned

The absolute success rate per search task was low
at 43.18
The Search experience generally consists of
multiple steps and multiple searches within and
across sources
The source with the fewest number of steps was a
Colleague or Personal Contact. The source with
the greatest number of steps was External Web
Search Browse
Each source has its own behavior, business rules,
functional architectures users need to learn
each system
The Intranet search was selected as a logical
place to look for information in over 65 of the
use cases. However, the success rate for the
Intranet Google search was only 35

14
What We Learned

You need to understand search from the searchers
point of view
Until you can see it this way, you are simply
guessing at what problems need to be solved and
how to solve them
This is a challenge for an organization whose
users are the public, but it is possible to
design an architecture that suits the general
public and the internal staff, and the important
stakeholders
The critical success factor is to design an open
architecture for agility and flexibility

15
Enterprise Search

Evolving Search Architecture

16
Search Engine Parts
Contextualization, Personalization, Recommender,
Content Syndication, QA Systems, Intelligent
Search
Search System Inputs
Query Transformation Matching
Search Outputs
Query Manipulation Simple search Fielded
Search Query Processing Algorithms Search term
assistance (thesaurus, dictionaries) Search
language selection Sources selection
Boolean matching Exact phrase matching Fuzzy
matching Term matching synonym expansion Term
matching cross language expansion Term weighted
matching Query term root matching dictionary
expansion Wild card matching Term proximity
matching Neural network expansion Genetic
algorithm expansion
Metadata Repository for Enterprise Search
Index Architecture Construction
Display Results Integrated search results Search
results sorting Search results contextualization S
earch results relevancy ranking
Vocabulary Support (thesaurus)
Classification Schemes (taxonomy)

Query Processing
Transforming the query to add tags,
Synonyms, etc.

What the user sees
The basic search programs
Foundation or Baseline of The Search System
17
Enterprise Search Proposed Solution

What Enterprise Search is and what it is not
Distinguish between the web services platform for
accessing Enterprise Search and the enterprises
full store of content (its not only electronic,
and its not all in html format!!)
What kind of an architecture do you need to
support enterprise search?
What kind of tools do you need to support
enterprise search? Its much more than just a web
crawler.
What do you need to support true concept level
cross-language searching?
How does Enterprise Search respect security
classifications with/without restricting
knowledge of what exists?

18
(No Transcript)
19
Taxonomies and Ontologies in Search
20
Using Taxonomies in Search

Controlled vocabularies guiding users through
the search process (flat taxonomies)
Classification schemes browsing, navigation,
syndication, contextualization (hierarchical
taxonomies)
Metadata supporting fielded search (faceted
taxonomies)
Thesauri supporting knowledge discovery,
preventing Zero results, supporting
cross-language searching (network taxonomies)
Synonym Rings improving relevancy, reducing
information scattering, managing recall (ring
taxonomies)

21
Vision of An Enterprise Advanced Search
Ring taxonomy
Ring taxonomy
Flat taxonomy
Hierarchical taxonomy
Fielded Search Faceted Taxonomy
22
Ring Taxonomy
Metadata
Network Taxonomy
23
Data Driving the Enterprise Architecture

30 Second Description of Taxonomies and Ontologies

24
Taxonomies Ontologies

I just mentioned several kinds of taxonomies that
search
Faceted taxonomies (metadata schemes)
Flat taxonomies (controlled vocabularies,
picklists)
Hierarchical taxonomies (classification schemes)
Network taxonomies (thesauri, semantic networks)
Ring taxonomies to handle synonym rings and
synonym searhing
Lets walk through these kinds of taxonomies
Then lets see how they all make up an ontology

25
Facet Taxonomies
Faceted taxonomy represented as a star data
structure. Each node in the start structure
is liked to the center focus. Any node can be
linked to other nodes in other stars. Appears
simple, but becomes complex quickly.
26
Core Metadata Strategy

What is a core metadata strategy?
Whats the process you use to discover your
organizations core metadata strategy?
Libraries have many core metadata standards
COSATI, Dublin Core, MARC, MODS, COSATI, AIIM
TR48
The important question for metadata in search is
how are you using your metadata to support
search?
Too often we put a dumb search engine on top of
smart metadata and do nothing more with it than
publish it
Its time to think smarter about how we use our
metadata

27
Purpose of Core Metadata
Identification/ Distinction
Search Browse
Compliant Document Management
Use Management
28
Capturing Core Metadata
Identification/ Distinction
Compliant Document Management
Search Browse
Use Management
Human Creation
Programmatic Capture
Extrapolate from Business Rules
Inherit from System Context
29
Flat Taxonomy Structure
Energy Environment Education
Economics Transport Trade
Labor Agriculture
30
Hierarchical Taxonomy
A hierarchical taxonomy is represented as a tree
data structure in a database application. The
tree data structure consists of nodes and
links. In an RDBMS environment, the
relationships become associations. In a
hierarchical taxonomy, a node can have only one
parent.
31
Hierarchical Taxonomies Classification Schemes

Ranganathan is the supreme authority on how to
create a well-design classification scheme
Most of what we teach in graduate school, though,
is how to use a classification scheme not how to
create one
Classification schemes are the controlling
reference source for metadata attributes
For example, the Enterprise Topic Classification
Scheme is the reference source for the attribute
Topic
We use tools to help us discover what the scheme
should be and also to help us classify content

32
Network Taxonomies
A network taxonomy is a plex data structure.
Each node can have more than one parent. Any
item in a plex structure can be linked to any
other item. In plex structures, links can be
meaningful different.
33
Ring Taxonomy
Poverty mitigation
Poverty alleviation
Poverty reducation
Poverty elimination
Poverty prevention
Poverty reduction
Poverty abatement
Rings can include all kinds of synonyms - true,
misspellings, predecessors, abbreviations
Poverty eradication
34
Ontology Design from 50,000 Feet
uses
Contextual Matrix Sensiing
Understood
Context
Business Rule
Has
Topic Class Scheme
Has Meaning in
Content Entity1
User
Business Process Scheme
Has values
Has relationship to
Thesaurus
Has
Has
Metadata
uses
Content Parts
Country Names
Profile
Has
Region Names
Content Elements
Has
Metadata
Skill Sets/ Competencies
Contains
Has values
Content
Standard Statistical Variables
Not in progress
In progress
In Use
35
Building Maintaining Taxonomies

Moving Towards Automted Metadata Capture

36
Topic data class
3 Oracle Data classes
Subtopic Data Class
Relationships across data classes
37
Building and Maintaining Taxonomies

Moving towards automated metadata generation
means that catalogers shift their effort to
reviewing the metadata generated and to
maintaining LCSH and LC classification as part of
a suite of categorization tools
Level of effort shifts to training and developing
the tools and away from original cataloging and
metadata capture
Continue to work closely with subject experts to
define the controlled vocabularies and
classification schemes

38
The problem with metadata

Metadata.
Is expensive and time consuming to create
Is sometimes subjective and not granular enough
Doesnt always address the ways that users and
systems think about the information it describes
May not tell us enough about the information to
trust it
May address only one context the context for
which it is created
May live in the source application where it was
created
May not be as accessible as the information
object
The future depends on metadata so we have to
resolve these problems
We are resolving the problem using automated
metadata creation and extraction technologies
Metadata extraction uses rules, classifiers and
grammar engines
Metadata creation uses categorization and rules
engines

39
Smart Use of Technologies

Sample structure Bank Topics Classification
Scheme (hierarchical taxonomy)
Oracle data classes used to represent Topic
Classification scheme
hierarchical taxonomy as reference source for the
attribute Topic
used for Browse, Search, Content Syndication,
Personalization
1st challenge is to architect the hierarchy
correctly
3 distinct data classes, not a tree structure
with inheritance
Allows you to use the three data classes for
distinct functions across systems but still
enforce relationships across the classes

40
What is Teragram?

Semantic analysis tools which support concept
extraction, categorization, summarization and
pattern matching rules engines
Teragram works in 23 languages
Use categorization to capture Topics, Business
Activities, Regions, Sectors, Themes, etc.
Use Concept Extraction to capture keywords
Use Rules Engine to capture Loan , Credit ,
Project ID, Trust Fund , etc.
Use Summarization to generate a gist of the
content

41
How does semantic analysis work?
42
Automatically Generated Metadata
Metadata is generated using a categorization
profile reflects the way people organize.
43
Automatically Extracted Metadata
Proper Noun Profile for People Names uses
grammars to find and extract the names of people
referenced in the document.
lt?xml version"1.0" encoding"UTF-8"?gt ltProper_No
un_Conceptgt ltSourcegtltSource_Typegtfilelt/Source_Typ
egt ltSource_NamegtW/Concept Extraction/Media
Monitoring Negative Training Set/
001B950F2EE8D0B4452570B4003FF816.txtlt/Source_Namegt
lt/SourcegtltProfile_NamegtPEOPLE_ORGlt/Profile_Namegt
ltkeywordsgtAbdul Salam Syed, Aruna Roy,
Arundhati Roy, Arvind Kesarival, Bharat Dogra,
Kwazulu Natal, Madhu Bhaduri, lt/keywordsgtltkeyword_
countgt7lt/keyword_countgt lt/Proper_Noun_Conceptgt
44
Semantic Analysis Basics

Once you have made some sense of the sentence,
reconstruct entities for information extraction
(compose)
Identify names and other fixed form expressions
people, organizations, conferences
Identify basic noun groups, verb groups,
presentations, other grammatical elements
Construct complex noun groups and verb groups
Identify event structures
Identify common elements and associate

45
Categorizing Content

Lets look how were categorizing our content to
this structure automatically
Topic classification, geographical region
assignment, keywording examples
Can apply this approach to any kind of content
Enables us to build a robust metadata repository
model, with strong metadata quality, to move
towards SI at the functional level
Also note that we can do this across many
languages

46
Leveraging the Structure

Each subtopic is a knowledge domain (hierarchical
taxonomy) and each subtopic has an extensive
concept level definition (1,000 5,000
concepts)
Concepts are controlled vocabularies in their raw
form (flat taxonomy)
Concepts with relationships (extensive per new
Z39.19 standard) comprise semantic network
(network taxonomy)
Categorization tools work with topic structure
concept definitions to categorize and index
content
The following screen illustrates how that same
structure is embedded into Teragram profile to
support categorization

47
Subtopics
Domain concepts or controlled vocabulary
48
Extensive operators allow us to write grammatical
rules to manage typical semantic problems
49
Concept based rules engine allows us to define
patterns to capture other kinds of data
50
Example of use of Authority Control to capture
country names but extract authorized version of
country name
Example of use of a gazetteer concept
extraction rules engine to support semantic
interoperability
51
Use of concept extraction rules engine to
capture Loan , Credit , Project ID
52
Enterprise Profile Creation and Maintenance

Enterprise Metadata Profile
Concept Extraction Technology
Country
Organization Name
People Name
Series Name/Collection Title
Author/Creator
Title
Publisher
Standard Statistical Variable
Version/Edition
Categorization Technology
Topic Categorization
Business Function Categorization
Region Categorization
Sector Categorization
Theme Categorization

UCM Service Requests
Update Change Requests
Data Governance Process for Topics, Business
Function, Country, Region, Keywords, People,
Organizations, Project ID
e-CDS Reference Sources for Country, Region,
Topics Business Function, Keywords, Project ID,
People, Organization
Enterprise Profile Development Maintenance
System 5
TK240 Client
Teragram Team
System 4
System 1
System 2
System 3
53
Content Owners
Content Owners
Dedicated Server Teragram Semantic Engine
Concept Extraction, Categorization, Clustering,
Rule Based Engine, Language Detection
APIs Integration
APIs Integration
ISP Integration
Functional Team
IRIS Integration
Business Analyst
Enterprise Metadata Capture Strategy TK240
Client XML Output
Content Capture
Content Capture
XML Wrapped Metadata
XML Wrapped Metadata
APIs Integration
APIs Technical Integration
Enterprise Profile Development Maintenance
Factiva Metadata Database
ImageBank Integration
Reference Sources
Indexers
Librarians
Enterprise Metadata Capture Functional
Reference Model
54
Caution Regarding Tools

Not all tools will do what we describing here
You need to have an underlying semantic engine
which can perform semantic analysis
You need to have a semantic engine in multiple
languages semantics vary by language
You need to have access to the programs through a
user-friendly interface so you can adapt them to
your environment without having to have
programming knowledge
You need to have several different kinds of
technologies to do what Im describing here
Not all the tools on the market today support
this work

55
Impacts Outcomes

Information Access impacts
Increased precision of search
Better control over recall
Searching like we talk
Exact match searching known item searching will
work better
Metadata based searching now begins to resemble
full-text searching but with all the advantages
of structure context, and a significant
reduction in the amount of noise
Productivity Improvements
Can now assign deep metadata to all kinds of
content
Remove the human review aspect from the metadata
capture
Reduce unit times where human review is still
used
Information Quality impacts
Apply quality metrics at the metadata level to
eliminate need to build fuzzy search
architectures these rarely scale or improve in
performance
Use the technologies to identify and fix problems
with data

Ontologies, Taxonomies and Search - PowerPoint PPT Presentation

Ontologies, Taxonomies and Search

... know how it has been configured or what components it has ... Ranganathan is the supreme authority on how to create a well-design classification scheme ... – PowerPoint PPT presentation