Ontologies, Taxonomies and Search - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Ontologies, Taxonomies and Search

Description:

... know how it has been configured or what components it has ... Ranganathan is the supreme authority on how to create a well-design classification scheme ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 57
Provided by: clifford99
Category:

less

Transcript and Presenter's Notes

Title: Ontologies, Taxonomies and Search


1
Ontologies, Taxonomies and Search
  • Denise Bedford
  • Special Libraries Association
  • Baltimore, Maryland
  • June, 2006

2
Presentation Overview
  • I will talk today from the end game perspective
    of search placing taxonomies and ontologies in
    that context
  • Search User Needs
  • Search Functional Architecture
  • Metadata Categorization
  • Semantic Technologies

3
Search User Needs
4
Basic Assumptions About Search Systems
  • Search systems are not WYSIWYG what you see is
    not what you get
  • Search systems are more like ice-bergs - most of
    the search system is below the surface you
    cant see it, you dont know how it has been
    configured or what components it has
  • Helpful to have a framework against which you can
    judge the suitability of any search tool to your
    environment
  • Assess the search environment before you decide
    what kind of a search system you need

5
The Environment
  • Like other libraries we acquire, create, license,
    access information in 30 subject domains
    ranging from Law Justice, to Transport, to
    Environment, to Agriculture, to Education, Health
    Nutrition, etc.
  • We have 500 different kinds of content/document
    types
  • We have a set of business processes which
    represent the way we do our work
  • We have six working languages but actually
    working in many more
  • We have a rich history content dating back to
    the 1940s
  • Our priorities change over time
  • Everyone in the Bank is a recognized
    international expert

6
Underlying Search Architecture
Silos of Information Multiple search engines
and user interfaces
7
Search Environment
  • It is very challenging to design a search system
    which meets the needs of this complex environment
  • Internal Experts do known item searching and
    look for other experts to talk to
  • General public looks for information about
    something
  • Most people dont know all the dimensions of the
    Bank
  • People need to search in other than English and
    find content written in other languages
  • Our new Knowledge and Learning initiative is
    moving us towards a W3 working environment -
    What information I need when I need it, where I
    need it
  • We need a search architecture that supports the
    users needs

8
Information Architecture Moving Towards
Contextualization
3W Information Model What I Want Where I
Want When I Want
Customized Access to Information
myPortal Client Connections
Operations Portal myPortal
Individualized Access (Context Sensitive
Individual)
Structured
Intranet External Web
Group Access (Context Sensitive Group)
Anonymous Access (Context Free)
Content Information Assets
Less structured
Basic Information
Customer Information
People
9
Search Use Case Analysis
  • Why Search Wasnt Working

10
Search Has Always Been Problemmatic
  • Despite any reports you may read in the general
    literature, buying a fast crawler will not solve
    your search problems
  • Implementing a fast crawler simply surfaces
    information management and data quality
    challenges directly to the users
  • The crawler approach provides high hit rates, and
    very low relevancy rates
  • It was a challenge to understand the search
    problem from the system-side
  • Try to understand the search problem from the
    users side

11
Consultant is looking for the CAS for the Congo.
Steps 1 - 2
F
DEC Consultant
Intranet Search
Steps 4-6
F
ImageBank Search
12
Adviser wants to find social indicators, relevant
quotes and Banks position on racism, as well as
learn what the Bank has done in this area in
order to prepare a 30 min. speech for a WB
Managing Director for conference.
13
What We Learned
  • The absolute success rate per search task was low
    at 43.18
  • The Search experience generally consists of
    multiple steps and multiple searches within and
    across sources
  • The source with the fewest number of steps was a
    Colleague or Personal Contact. The source with
    the greatest number of steps was External Web
    Search Browse
  • Each source has its own behavior, business rules,
    functional architectures users need to learn
    each system
  • The Intranet search was selected as a logical
    place to look for information in over 65 of the
    use cases. However, the success rate for the
    Intranet Google search was only 35

14
What We Learned
  • You need to understand search from the searchers
    point of view
  • Until you can see it this way, you are simply
    guessing at what problems need to be solved and
    how to solve them
  • This is a challenge for an organization whose
    users are the public, but it is possible to
    design an architecture that suits the general
    public and the internal staff, and the important
    stakeholders
  • The critical success factor is to design an open
    architecture for agility and flexibility

15
Enterprise Search
  • Evolving Search Architecture

16
Search Engine Parts
Contextualization, Personalization, Recommender,
Content Syndication, QA Systems, Intelligent
Search
Search System Inputs
Query Transformation Matching
Search Outputs
Query Manipulation Simple search Fielded
Search Query Processing Algorithms Search term
assistance (thesaurus, dictionaries) Search
language selection Sources selection
Boolean matching Exact phrase matching Fuzzy
matching Term matching synonym expansion Term
matching cross language expansion Term weighted
matching Query term root matching dictionary
expansion Wild card matching Term proximity
matching Neural network expansion Genetic
algorithm expansion
Metadata Repository for Enterprise Search
Index Architecture Construction
Display Results Integrated search results Search
results sorting Search results contextualization S
earch results relevancy ranking
Vocabulary Support (thesaurus)
Classification Schemes (taxonomy)
  • Query Processing
  • Transforming the query to add tags,
  • Synonyms, etc.

What the user sees
The basic search programs
Foundation or Baseline of The Search System
17
Enterprise Search Proposed Solution
  • What Enterprise Search is and what it is not
  • Distinguish between the web services platform for
    accessing Enterprise Search and the enterprises
    full store of content (its not only electronic,
    and its not all in html format!!)
  • What kind of an architecture do you need to
    support enterprise search?
  • What kind of tools do you need to support
    enterprise search? Its much more than just a web
    crawler.
  • What do you need to support true concept level
    cross-language searching?
  • How does Enterprise Search respect security
    classifications with/without restricting
    knowledge of what exists?

18
(No Transcript)
19
Taxonomies and Ontologies in Search
20
Using Taxonomies in Search
  • Controlled vocabularies guiding users through
    the search process (flat taxonomies)
  • Classification schemes browsing, navigation,
    syndication, contextualization (hierarchical
    taxonomies)
  • Metadata supporting fielded search (faceted
    taxonomies)
  • Thesauri supporting knowledge discovery,
    preventing Zero results, supporting
    cross-language searching (network taxonomies)
  • Synonym Rings improving relevancy, reducing
    information scattering, managing recall (ring
    taxonomies)

21
Vision of An Enterprise Advanced Search
Ring taxonomy
Ring taxonomy
Flat taxonomy
Hierarchical taxonomy
Fielded Search Faceted Taxonomy
22
Ring Taxonomy
Metadata
Network Taxonomy
23
Data Driving the Enterprise Architecture
  • 30 Second Description of Taxonomies and Ontologies

24
Taxonomies Ontologies
  • I just mentioned several kinds of taxonomies that
    search
  • Faceted taxonomies (metadata schemes)
  • Flat taxonomies (controlled vocabularies,
    picklists)
  • Hierarchical taxonomies (classification schemes)
  • Network taxonomies (thesauri, semantic networks)
  • Ring taxonomies to handle synonym rings and
    synonym searhing
  • Lets walk through these kinds of taxonomies
  • Then lets see how they all make up an ontology

25
Facet Taxonomies
Faceted taxonomy represented as a star data
structure. Each node in the start structure
is liked to the center focus. Any node can be
linked to other nodes in other stars. Appears
simple, but becomes complex quickly.
26
Core Metadata Strategy
  • What is a core metadata strategy?
  • Whats the process you use to discover your
    organizations core metadata strategy?
  • Libraries have many core metadata standards
    COSATI, Dublin Core, MARC, MODS, COSATI, AIIM
    TR48
  • The important question for metadata in search is
    how are you using your metadata to support
    search?
  • Too often we put a dumb search engine on top of
    smart metadata and do nothing more with it than
    publish it
  • Its time to think smarter about how we use our
    metadata

27
Purpose of Core Metadata
Identification/ Distinction
Search Browse
Compliant Document Management
Use Management
28
Capturing Core Metadata
Identification/ Distinction
Compliant Document Management
Search Browse
Use Management
Human Creation
Programmatic Capture
Extrapolate from Business Rules
Inherit from System Context
29
Flat Taxonomy Structure
Energy Environment Education
Economics Transport Trade
Labor Agriculture
30
Hierarchical Taxonomy
A hierarchical taxonomy is represented as a tree
data structure in a database application. The
tree data structure consists of nodes and
links. In an RDBMS environment, the
relationships become associations. In a
hierarchical taxonomy, a node can have only one
parent.
31
Hierarchical Taxonomies Classification Schemes
  • Ranganathan is the supreme authority on how to
    create a well-design classification scheme
  • Most of what we teach in graduate school, though,
    is how to use a classification scheme not how to
    create one
  • Classification schemes are the controlling
    reference source for metadata attributes
  • For example, the Enterprise Topic Classification
    Scheme is the reference source for the attribute
    Topic
  • We use tools to help us discover what the scheme
    should be and also to help us classify content

32
Network Taxonomies
A network taxonomy is a plex data structure.
Each node can have more than one parent. Any
item in a plex structure can be linked to any
other item. In plex structures, links can be
meaningful different.
33
Ring Taxonomy
Poverty mitigation
Poverty alleviation
Poverty reducation
Poverty elimination
Poverty prevention
Poverty reduction
Poverty abatement
Rings can include all kinds of synonyms - true,
misspellings, predecessors, abbreviations
Poverty eradication
34
Ontology Design from 50,000 Feet
uses
Contextual Matrix Sensiing
Understood
Context
Business Rule
Has
Topic Class Scheme
Has Meaning in
Content Entity1
User
Business Process Scheme
Has values
Has relationship to
Thesaurus
Has
Has
Metadata
uses
Content Parts
Country Names
Profile
Has
Region Names
Content Elements
Has
Metadata
Skill Sets/ Competencies
Contains
Has values
Content
Standard Statistical Variables
Not in progress
In progress
In Use
35
Building Maintaining Taxonomies
  • Moving Towards Automted Metadata Capture

36
Topic data class
3 Oracle Data classes
Subtopic Data Class
Relationships across data classes
37
Building and Maintaining Taxonomies
  • Moving towards automated metadata generation
    means that catalogers shift their effort to
    reviewing the metadata generated and to
    maintaining LCSH and LC classification as part of
    a suite of categorization tools
  • Level of effort shifts to training and developing
    the tools and away from original cataloging and
    metadata capture
  • Continue to work closely with subject experts to
    define the controlled vocabularies and
    classification schemes

38
The problem with metadata
  • Metadata.
  • Is expensive and time consuming to create
  • Is sometimes subjective and not granular enough
  • Doesnt always address the ways that users and
    systems think about the information it describes
  • May not tell us enough about the information to
    trust it
  • May address only one context the context for
    which it is created
  • May live in the source application where it was
    created
  • May not be as accessible as the information
    object
  • The future depends on metadata so we have to
    resolve these problems
  • We are resolving the problem using automated
    metadata creation and extraction technologies
  • Metadata extraction uses rules, classifiers and
    grammar engines
  • Metadata creation uses categorization and rules
    engines

39
Smart Use of Technologies
  • Sample structure Bank Topics Classification
    Scheme (hierarchical taxonomy)
  • Oracle data classes used to represent Topic
    Classification scheme
  • hierarchical taxonomy as reference source for the
    attribute Topic
  • used for Browse, Search, Content Syndication,
    Personalization
  • 1st challenge is to architect the hierarchy
    correctly
  • 3 distinct data classes, not a tree structure
    with inheritance
  • Allows you to use the three data classes for
    distinct functions across systems but still
    enforce relationships across the classes

40
What is Teragram?
  • Semantic analysis tools which support concept
    extraction, categorization, summarization and
    pattern matching rules engines
  • Teragram works in 23 languages
  • Use categorization to capture Topics, Business
    Activities, Regions, Sectors, Themes, etc.
  • Use Concept Extraction to capture keywords
  • Use Rules Engine to capture Loan , Credit ,
    Project ID, Trust Fund , etc.
  • Use Summarization to generate a gist of the
    content

41
How does semantic analysis work?
42
Automatically Generated Metadata
Metadata is generated using a categorization
profile reflects the way people organize.
43
Automatically Extracted Metadata
Proper Noun Profile for People Names uses
grammars to find and extract the names of people
referenced in the document.
lt?xml version"1.0" encoding"UTF-8"?gt ltProper_No
un_Conceptgt ltSourcegtltSource_Typegtfilelt/Source_Typ
egt ltSource_NamegtW/Concept Extraction/Media
Monitoring Negative Training Set/
001B950F2EE8D0B4452570B4003FF816.txtlt/Source_Namegt
lt/SourcegtltProfile_NamegtPEOPLE_ORGlt/Profile_Namegt
ltkeywordsgtAbdul Salam Syed, Aruna Roy,
Arundhati Roy, Arvind Kesarival, Bharat Dogra,
Kwazulu Natal, Madhu Bhaduri, lt/keywordsgtltkeyword_
countgt7lt/keyword_countgt lt/Proper_Noun_Conceptgt
44
Semantic Analysis Basics
  • Once you have made some sense of the sentence,
    reconstruct entities for information extraction
    (compose)
  • Identify names and other fixed form expressions
    people, organizations, conferences
  • Identify basic noun groups, verb groups,
    presentations, other grammatical elements
  • Construct complex noun groups and verb groups
  • Identify event structures
  • Identify common elements and associate

45
Categorizing Content
  • Lets look how were categorizing our content to
    this structure automatically
  • Topic classification, geographical region
    assignment, keywording examples
  • Can apply this approach to any kind of content
  • Enables us to build a robust metadata repository
    model, with strong metadata quality, to move
    towards SI at the functional level
  • Also note that we can do this across many
    languages

46
Leveraging the Structure
  • Each subtopic is a knowledge domain (hierarchical
    taxonomy) and each subtopic has an extensive
    concept level definition (1,000 5,000
    concepts)
  • Concepts are controlled vocabularies in their raw
    form (flat taxonomy)
  • Concepts with relationships (extensive per new
    Z39.19 standard) comprise semantic network
    (network taxonomy)
  • Categorization tools work with topic structure
    concept definitions to categorize and index
    content
  • The following screen illustrates how that same
    structure is embedded into Teragram profile to
    support categorization

47
Subtopics
Domain concepts or controlled vocabulary
48
Extensive operators allow us to write grammatical
rules to manage typical semantic problems
49
Concept based rules engine allows us to define
patterns to capture other kinds of data
50
Example of use of Authority Control to capture
country names but extract authorized version of
country name
Example of use of a gazetteer concept
extraction rules engine to support semantic
interoperability
51
Use of concept extraction rules engine to
capture Loan , Credit , Project ID
52
Enterprise Profile Creation and Maintenance
  • Enterprise Metadata Profile
  • Concept Extraction Technology
  • Country
  • Organization Name
  • People Name
  • Series Name/Collection Title
  • Author/Creator
  • Title
  • Publisher
  • Standard Statistical Variable
  • Version/Edition
  • Categorization Technology
  • Topic Categorization
  • Business Function Categorization
  • Region Categorization
  • Sector Categorization
  • Theme Categorization

UCM Service Requests
Update Change Requests
Data Governance Process for Topics, Business
Function, Country, Region, Keywords, People,
Organizations, Project ID
e-CDS Reference Sources for Country, Region,
Topics Business Function, Keywords, Project ID,
People, Organization
Enterprise Profile Development Maintenance
System 5
TK240 Client
Teragram Team
System 4
System 1
System 2
System 3
53
Content Owners
Content Owners
Dedicated Server Teragram Semantic Engine
Concept Extraction, Categorization, Clustering,
Rule Based Engine, Language Detection
APIs Integration
APIs Integration
ISP Integration
Functional Team
IRIS Integration
Business Analyst
Enterprise Metadata Capture Strategy TK240
Client XML Output
Content Capture
Content Capture
XML Wrapped Metadata
XML Wrapped Metadata
APIs Integration
APIs Technical Integration
Enterprise Profile Development Maintenance
Factiva Metadata Database
ImageBank Integration
Reference Sources
Indexers
Librarians
Enterprise Metadata Capture Functional
Reference Model
54
Caution Regarding Tools
  • Not all tools will do what we describing here
  • You need to have an underlying semantic engine
    which can perform semantic analysis
  • You need to have a semantic engine in multiple
    languages semantics vary by language
  • You need to have access to the programs through a
    user-friendly interface so you can adapt them to
    your environment without having to have
    programming knowledge
  • You need to have several different kinds of
    technologies to do what Im describing here
  • Not all the tools on the market today support
    this work

55
Impacts Outcomes
  • Information Access impacts
  • Increased precision of search
  • Better control over recall
  • Searching like we talk
  • Exact match searching known item searching will
    work better
  • Metadata based searching now begins to resemble
    full-text searching but with all the advantages
    of structure context, and a significant
    reduction in the amount of noise
  • Productivity Improvements
  • Can now assign deep metadata to all kinds of
    content
  • Remove the human review aspect from the metadata
    capture
  • Reduce unit times where human review is still
    used
  • Information Quality impacts
  • Apply quality metrics at the metadata level to
    eliminate need to build fuzzy search
    architectures these rarely scale or improve in
    performance
  • Use the technologies to identify and fix problems
    with data

56
Thank You.
  • Questions Discussions
  • Contact Information
  • dbedford_at_worldbank.org
  • db233_at_georgetown.edu
Write a Comment
User Comments (0)
About PowerShow.com