Title: Ontologies, Taxonomies and Search
1Ontologies, Taxonomies and Search
- Denise Bedford
- Special Libraries Association
- Baltimore, Maryland
- June, 2006
2Presentation Overview
- I will talk today from the end game perspective
of search placing taxonomies and ontologies in
that context - Search User Needs
- Search Functional Architecture
- Metadata Categorization
- Semantic Technologies
3Search User Needs
4Basic Assumptions About Search Systems
- Search systems are not WYSIWYG what you see is
not what you get - Search systems are more like ice-bergs - most of
the search system is below the surface you
cant see it, you dont know how it has been
configured or what components it has - Helpful to have a framework against which you can
judge the suitability of any search tool to your
environment - Assess the search environment before you decide
what kind of a search system you need
5The Environment
- Like other libraries we acquire, create, license,
access information in 30 subject domains
ranging from Law Justice, to Transport, to
Environment, to Agriculture, to Education, Health
Nutrition, etc. - We have 500 different kinds of content/document
types - We have a set of business processes which
represent the way we do our work - We have six working languages but actually
working in many more - We have a rich history content dating back to
the 1940s - Our priorities change over time
- Everyone in the Bank is a recognized
international expert
6Underlying Search Architecture
Silos of Information Multiple search engines
and user interfaces
7Search Environment
- It is very challenging to design a search system
which meets the needs of this complex environment
- Internal Experts do known item searching and
look for other experts to talk to - General public looks for information about
something - Most people dont know all the dimensions of the
Bank - People need to search in other than English and
find content written in other languages - Our new Knowledge and Learning initiative is
moving us towards a W3 working environment -
What information I need when I need it, where I
need it - We need a search architecture that supports the
users needs
8Information Architecture Moving Towards
Contextualization
3W Information Model What I Want Where I
Want When I Want
Customized Access to Information
myPortal Client Connections
Operations Portal myPortal
Individualized Access (Context Sensitive
Individual)
Structured
Intranet External Web
Group Access (Context Sensitive Group)
Anonymous Access (Context Free)
Content Information Assets
Less structured
Basic Information
Customer Information
People
9Search Use Case Analysis
10Search Has Always Been Problemmatic
- Despite any reports you may read in the general
literature, buying a fast crawler will not solve
your search problems - Implementing a fast crawler simply surfaces
information management and data quality
challenges directly to the users - The crawler approach provides high hit rates, and
very low relevancy rates - It was a challenge to understand the search
problem from the system-side - Try to understand the search problem from the
users side
11Consultant is looking for the CAS for the Congo.
Steps 1 - 2
F
DEC Consultant
Intranet Search
Steps 4-6
F
ImageBank Search
12Adviser wants to find social indicators, relevant
quotes and Banks position on racism, as well as
learn what the Bank has done in this area in
order to prepare a 30 min. speech for a WB
Managing Director for conference.
13What We Learned
- The absolute success rate per search task was low
at 43.18 - The Search experience generally consists of
multiple steps and multiple searches within and
across sources - The source with the fewest number of steps was a
Colleague or Personal Contact. The source with
the greatest number of steps was External Web
Search Browse - Each source has its own behavior, business rules,
functional architectures users need to learn
each system - The Intranet search was selected as a logical
place to look for information in over 65 of the
use cases. However, the success rate for the
Intranet Google search was only 35
14What We Learned
- You need to understand search from the searchers
point of view - Until you can see it this way, you are simply
guessing at what problems need to be solved and
how to solve them - This is a challenge for an organization whose
users are the public, but it is possible to
design an architecture that suits the general
public and the internal staff, and the important
stakeholders - The critical success factor is to design an open
architecture for agility and flexibility
15Enterprise Search
- Evolving Search Architecture
16Search Engine Parts
Contextualization, Personalization, Recommender,
Content Syndication, QA Systems, Intelligent
Search
Search System Inputs
Query Transformation Matching
Search Outputs
Query Manipulation Simple search Fielded
Search Query Processing Algorithms Search term
assistance (thesaurus, dictionaries) Search
language selection Sources selection
Boolean matching Exact phrase matching Fuzzy
matching Term matching synonym expansion Term
matching cross language expansion Term weighted
matching Query term root matching dictionary
expansion Wild card matching Term proximity
matching Neural network expansion Genetic
algorithm expansion
Metadata Repository for Enterprise Search
Index Architecture Construction
Display Results Integrated search results Search
results sorting Search results contextualization S
earch results relevancy ranking
Vocabulary Support (thesaurus)
Classification Schemes (taxonomy)
- Query Processing
- Transforming the query to add tags,
- Synonyms, etc.
What the user sees
The basic search programs
Foundation or Baseline of The Search System
17Enterprise Search Proposed Solution
- What Enterprise Search is and what it is not
- Distinguish between the web services platform for
accessing Enterprise Search and the enterprises
full store of content (its not only electronic,
and its not all in html format!!) - What kind of an architecture do you need to
support enterprise search? - What kind of tools do you need to support
enterprise search? Its much more than just a web
crawler. - What do you need to support true concept level
cross-language searching? - How does Enterprise Search respect security
classifications with/without restricting
knowledge of what exists?
18(No Transcript)
19Taxonomies and Ontologies in Search
20Using Taxonomies in Search
- Controlled vocabularies guiding users through
the search process (flat taxonomies) - Classification schemes browsing, navigation,
syndication, contextualization (hierarchical
taxonomies) - Metadata supporting fielded search (faceted
taxonomies) - Thesauri supporting knowledge discovery,
preventing Zero results, supporting
cross-language searching (network taxonomies) - Synonym Rings improving relevancy, reducing
information scattering, managing recall (ring
taxonomies)
21Vision of An Enterprise Advanced Search
Ring taxonomy
Ring taxonomy
Flat taxonomy
Hierarchical taxonomy
Fielded Search Faceted Taxonomy
22Ring Taxonomy
Metadata
Network Taxonomy
23Data Driving the Enterprise Architecture
- 30 Second Description of Taxonomies and Ontologies
24Taxonomies Ontologies
- I just mentioned several kinds of taxonomies that
search - Faceted taxonomies (metadata schemes)
- Flat taxonomies (controlled vocabularies,
picklists) - Hierarchical taxonomies (classification schemes)
- Network taxonomies (thesauri, semantic networks)
- Ring taxonomies to handle synonym rings and
synonym searhing - Lets walk through these kinds of taxonomies
- Then lets see how they all make up an ontology
25Facet Taxonomies
Faceted taxonomy represented as a star data
structure. Each node in the start structure
is liked to the center focus. Any node can be
linked to other nodes in other stars. Appears
simple, but becomes complex quickly.
26Core Metadata Strategy
- What is a core metadata strategy?
- Whats the process you use to discover your
organizations core metadata strategy? - Libraries have many core metadata standards
COSATI, Dublin Core, MARC, MODS, COSATI, AIIM
TR48 - The important question for metadata in search is
how are you using your metadata to support
search? - Too often we put a dumb search engine on top of
smart metadata and do nothing more with it than
publish it - Its time to think smarter about how we use our
metadata
27Purpose of Core Metadata
Identification/ Distinction
Search Browse
Compliant Document Management
Use Management
28Capturing Core Metadata
Identification/ Distinction
Compliant Document Management
Search Browse
Use Management
Human Creation
Programmatic Capture
Extrapolate from Business Rules
Inherit from System Context
29Flat Taxonomy Structure
Energy Environment Education
Economics Transport Trade
Labor Agriculture
30Hierarchical Taxonomy
A hierarchical taxonomy is represented as a tree
data structure in a database application. The
tree data structure consists of nodes and
links. In an RDBMS environment, the
relationships become associations. In a
hierarchical taxonomy, a node can have only one
parent.
31Hierarchical Taxonomies Classification Schemes
- Ranganathan is the supreme authority on how to
create a well-design classification scheme - Most of what we teach in graduate school, though,
is how to use a classification scheme not how to
create one - Classification schemes are the controlling
reference source for metadata attributes - For example, the Enterprise Topic Classification
Scheme is the reference source for the attribute
Topic - We use tools to help us discover what the scheme
should be and also to help us classify content
32Network Taxonomies
A network taxonomy is a plex data structure.
Each node can have more than one parent. Any
item in a plex structure can be linked to any
other item. In plex structures, links can be
meaningful different.
33Ring Taxonomy
Poverty mitigation
Poverty alleviation
Poverty reducation
Poverty elimination
Poverty prevention
Poverty reduction
Poverty abatement
Rings can include all kinds of synonyms - true,
misspellings, predecessors, abbreviations
Poverty eradication
34Ontology Design from 50,000 Feet
uses
Contextual Matrix Sensiing
Understood
Context
Business Rule
Has
Topic Class Scheme
Has Meaning in
Content Entity1
User
Business Process Scheme
Has values
Has relationship to
Thesaurus
Has
Has
Metadata
uses
Content Parts
Country Names
Profile
Has
Region Names
Content Elements
Has
Metadata
Skill Sets/ Competencies
Contains
Has values
Content
Standard Statistical Variables
Not in progress
In progress
In Use
35Building Maintaining Taxonomies
- Moving Towards Automted Metadata Capture
36Topic data class
3 Oracle Data classes
Subtopic Data Class
Relationships across data classes
37Building and Maintaining Taxonomies
- Moving towards automated metadata generation
means that catalogers shift their effort to
reviewing the metadata generated and to
maintaining LCSH and LC classification as part of
a suite of categorization tools - Level of effort shifts to training and developing
the tools and away from original cataloging and
metadata capture - Continue to work closely with subject experts to
define the controlled vocabularies and
classification schemes
38The problem with metadata
- Metadata.
- Is expensive and time consuming to create
- Is sometimes subjective and not granular enough
- Doesnt always address the ways that users and
systems think about the information it describes - May not tell us enough about the information to
trust it - May address only one context the context for
which it is created - May live in the source application where it was
created - May not be as accessible as the information
object - The future depends on metadata so we have to
resolve these problems - We are resolving the problem using automated
metadata creation and extraction technologies - Metadata extraction uses rules, classifiers and
grammar engines - Metadata creation uses categorization and rules
engines
39Smart Use of Technologies
- Sample structure Bank Topics Classification
Scheme (hierarchical taxonomy) - Oracle data classes used to represent Topic
Classification scheme - hierarchical taxonomy as reference source for the
attribute Topic - used for Browse, Search, Content Syndication,
Personalization - 1st challenge is to architect the hierarchy
correctly - 3 distinct data classes, not a tree structure
with inheritance - Allows you to use the three data classes for
distinct functions across systems but still
enforce relationships across the classes
40What is Teragram?
- Semantic analysis tools which support concept
extraction, categorization, summarization and
pattern matching rules engines - Teragram works in 23 languages
- Use categorization to capture Topics, Business
Activities, Regions, Sectors, Themes, etc. - Use Concept Extraction to capture keywords
- Use Rules Engine to capture Loan , Credit ,
Project ID, Trust Fund , etc. - Use Summarization to generate a gist of the
content
41How does semantic analysis work?
42Automatically Generated Metadata
Metadata is generated using a categorization
profile reflects the way people organize.
43Automatically Extracted Metadata
Proper Noun Profile for People Names uses
grammars to find and extract the names of people
referenced in the document.
lt?xml version"1.0" encoding"UTF-8"?gt ltProper_No
un_Conceptgt ltSourcegtltSource_Typegtfilelt/Source_Typ
egt ltSource_NamegtW/Concept Extraction/Media
Monitoring Negative Training Set/
001B950F2EE8D0B4452570B4003FF816.txtlt/Source_Namegt
lt/SourcegtltProfile_NamegtPEOPLE_ORGlt/Profile_Namegt
ltkeywordsgtAbdul Salam Syed, Aruna Roy,
Arundhati Roy, Arvind Kesarival, Bharat Dogra,
Kwazulu Natal, Madhu Bhaduri, lt/keywordsgtltkeyword_
countgt7lt/keyword_countgt lt/Proper_Noun_Conceptgt
44Semantic Analysis Basics
- Once you have made some sense of the sentence,
reconstruct entities for information extraction
(compose) - Identify names and other fixed form expressions
people, organizations, conferences - Identify basic noun groups, verb groups,
presentations, other grammatical elements - Construct complex noun groups and verb groups
- Identify event structures
- Identify common elements and associate
45Categorizing Content
- Lets look how were categorizing our content to
this structure automatically - Topic classification, geographical region
assignment, keywording examples - Can apply this approach to any kind of content
- Enables us to build a robust metadata repository
model, with strong metadata quality, to move
towards SI at the functional level - Also note that we can do this across many
languages
46Leveraging the Structure
- Each subtopic is a knowledge domain (hierarchical
taxonomy) and each subtopic has an extensive
concept level definition (1,000 5,000
concepts) - Concepts are controlled vocabularies in their raw
form (flat taxonomy) - Concepts with relationships (extensive per new
Z39.19 standard) comprise semantic network
(network taxonomy) - Categorization tools work with topic structure
concept definitions to categorize and index
content - The following screen illustrates how that same
structure is embedded into Teragram profile to
support categorization
47Subtopics
Domain concepts or controlled vocabulary
48Extensive operators allow us to write grammatical
rules to manage typical semantic problems
49Concept based rules engine allows us to define
patterns to capture other kinds of data
50Example of use of Authority Control to capture
country names but extract authorized version of
country name
Example of use of a gazetteer concept
extraction rules engine to support semantic
interoperability
51Use of concept extraction rules engine to
capture Loan , Credit , Project ID
52Enterprise Profile Creation and Maintenance
- Enterprise Metadata Profile
- Concept Extraction Technology
- Country
- Organization Name
- People Name
- Series Name/Collection Title
- Author/Creator
- Title
- Publisher
- Standard Statistical Variable
- Version/Edition
- Categorization Technology
- Topic Categorization
- Business Function Categorization
- Region Categorization
- Sector Categorization
- Theme Categorization
UCM Service Requests
Update Change Requests
Data Governance Process for Topics, Business
Function, Country, Region, Keywords, People,
Organizations, Project ID
e-CDS Reference Sources for Country, Region,
Topics Business Function, Keywords, Project ID,
People, Organization
Enterprise Profile Development Maintenance
System 5
TK240 Client
Teragram Team
System 4
System 1
System 2
System 3
53Content Owners
Content Owners
Dedicated Server Teragram Semantic Engine
Concept Extraction, Categorization, Clustering,
Rule Based Engine, Language Detection
APIs Integration
APIs Integration
ISP Integration
Functional Team
IRIS Integration
Business Analyst
Enterprise Metadata Capture Strategy TK240
Client XML Output
Content Capture
Content Capture
XML Wrapped Metadata
XML Wrapped Metadata
APIs Integration
APIs Technical Integration
Enterprise Profile Development Maintenance
Factiva Metadata Database
ImageBank Integration
Reference Sources
Indexers
Librarians
Enterprise Metadata Capture Functional
Reference Model
54Caution Regarding Tools
- Not all tools will do what we describing here
- You need to have an underlying semantic engine
which can perform semantic analysis - You need to have a semantic engine in multiple
languages semantics vary by language - You need to have access to the programs through a
user-friendly interface so you can adapt them to
your environment without having to have
programming knowledge - You need to have several different kinds of
technologies to do what Im describing here - Not all the tools on the market today support
this work
55Impacts Outcomes
- Information Access impacts
- Increased precision of search
- Better control over recall
- Searching like we talk
- Exact match searching known item searching will
work better - Metadata based searching now begins to resemble
full-text searching but with all the advantages
of structure context, and a significant
reduction in the amount of noise - Productivity Improvements
- Can now assign deep metadata to all kinds of
content - Remove the human review aspect from the metadata
capture - Reduce unit times where human review is still
used - Information Quality impacts
- Apply quality metrics at the metadata level to
eliminate need to build fuzzy search
architectures these rarely scale or improve in
performance - Use the technologies to identify and fix problems
with data
56Thank You.
- Questions Discussions
- Contact Information
- dbedford_at_worldbank.org
- db233_at_georgetown.edu