Taxonomies: Hidden but Critical Tools - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Taxonomies: Hidden but Critical Tools

Description:

No need for thesauri or subject headings. Full text gives all ... (Semi) Automatic Indexing. Basic theories. Thesaurus construction. Natural language processing ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 40
Provided by: marg112
Category:

less

Transcript and Presenter's Notes

Title: Taxonomies: Hidden but Critical Tools


1
TaxonomiesHidden but Critical Tools
  • Marjorie M.K. Hlava
  • President
  • Access Innovations, Inc.

2
Industry in change
  • Technology changes
  • Evolving standards
  • Mergers
  • New buzzwords
  • Hard to tell what is real

3
Popular Misconceptions
  • Computers can do it all
  • No need to index
  • No need for thesauri or subject headings
  • Full text gives all we need
  • Automatic full text
  • User friendly search engines
  • Search engines are indexes
  • User profiles provide the right context
  • Data filters give right answers

4
Some of it is true
  • What can we use?
  • Automatic - semi - classification
  • Depends..
  • Size of collection
  • Cost of the effort

5
Whats in??
  • Taxonomies
  • thesauri
  • hierarchies - classification
  • categorization
  • browsing
  • Wellformedness
  • Bricks and mortar, i.e., profit

6
Options for Access/Control
  • Keep track of the input
  • Thesaurus
  • Authority file
  • Maximize the access
  • Search engine
  • Browse list
  • Power of the word
  • McCain

7
What do we need?
  • The basics...
  • Authority file
  • People, places, things
  • Taxonomy
  • Thesaurus with authority file or document
    instance
  • Automatic Classification

8
Thesaurus Construction
  • Parts of a whole
  • Noun and noun phrases
  • People, places, things
  • Actions and reactions
  • Concepts and processes

9
Term Records -Thesaurus - format
  • Main Entries
  • Top Terms - TT
  • Broader Terms - BT
  • Narrower Terms - NT
  • Scope Notes - SN
  • History - HI
  • Date Term - added/changed - DA

10
Thesaurus - Format
  • Related Terms - RT
  • See - S
  • See Also - SA
  • Use - U
  • Use For - UF
  • Wellformedness W3C

11
What are the parts?
  • Natural Language Processing
  • Term forms
  • Term Relationships
  • Term Associations

12
Natural Language Processing
  • Morphological
  • Lexical Analysis
  • Syntactic
  • Numerical
  • Phraseological
  • Semantic Analysis
  • Pragmatic

13
Seven Major Parts of NLP
  • 1. Morphological
  • plural
  • past tense to present

14
Seven Major Parts of NLP
  • 2. Lexical Analysis
  • part of speech tagging
  • 3. Syntactic analysis
  • non phrase id
  • proper name boundary

15
Seven Major Parts of NLP
  • 4. Numeric concept boundary
  • 5. Semantic analysis
  • Proper name concept categorization
  • Numeric concept categorization
  • Semantic relation extraction
  • 6. Phraseological - discourse analysis
  • Text structure identification

16
Seven Major Parts of NLP
  • 7. Pragmatic analysis
  • Cause and effect relationships
  • Nurse and nursing
  • Common sense reasoning (buy ? possess)
  • Who has x ?
  • These are the people who brought you.....

17
Say it another way
  • Term standardization
  • Term forms
  • Term relationships
  • Term associations
  • Rule building / domain creation

18
Word Standardization
  • Split out chemical drug terms
  • Separates chemical drug terms for special
    treatment
  • Split out homonyms, non-English terms,
    and authority terms
  • Separates objects, proper names, place names,
    and dates for special treatment
  • Run spelling standardization program
  • Identifies variant spellings

19
Word Standardization
  • Run word standardization program
  • ie, ing, -ed, -s, es, pre-, non-, and -
  • Match preferred terms and synonyms

20
Term Forms
  • Noun
  • Adjective
  • Verb, adverb
  • Singular, plural
  • Initial articles
  • Spelling variants

21
Term Forms
  • Punctuation
  • Capitalization
  • Abbreviations

22
Term Relationships
  • Generic
  • Hierarchical
  • Systematic
  • Alphabetic
  • Instance
  • Poly-hierarchical

23
Term Associations
  • Cross references
  • All and some rule
  • Associative terms
  • Related terms

24
Rule building process
  • Put terms in context
  • Group like categories
  • Consider relationships
  • Standardize variants
  • Meld to a single concept rule
  • How much is really automatic???

25
Domains
  • Taxonomy
  • Term Record - thesaurus
  • Hierarchical Browse-able list
  • Handout in Booth 150

26
What else can we have?
  • Proximity
  • Stemming (lemmatization)
  • Truncation
  • Statistical clustering
  • Bayesian and others

27
Other terms and tools
  • Neural networks
  • Word normalization
  • Lexical (word) networks
  • Distance mapping
  • Pattern recognition

28
Moving toward the search engines
  • Term weighting
  • Frequency counts
  • Relevance
  • Precision
  • Recall

29
Classification of
Automatic Classification Systems
  • Evolving model
  • Noun Extractors
  • Rule Based Systems
  • Semantic Processors
  • Fuzzy Search Systems
  • Filtering Systems

30
(Semi) Automatic Indexing
  • Basic theories
  • Thesaurus construction
  • Natural language processing
  • Domain specific

31
Noun extractors
  • Noun Extractors
  • Use stop word list and frequency counts
  • Semio
  • Word Perfect 5.0
  • Recon
  • Prebuilt domains
  • Autonomy
  • Net Owl
  • Newsindexer

32
Rules Based Systems
  • Rule Based
  • Data Harmony
  • API
  • DTIC
  • Mapit

33
Semantic Processors
  • Synth Bank
  • n-Stein - expected
  • Quiver - beta

34
Fuzzy Search Systems
  • Dr. Link
  • Sovereign Hill

35
Filtering Systems
  • Screaming Media
  • Data Harmony

36
New Directions
  • Topic Maps - TAO
  • Topic
  • Associations
  • Occurrences
  • Relational Indexing
  • Index Visualization
  • Based on term records
  • Add the search engines.

37
Whats a user to do?
  • Enjoy the presentation
  • What about a database producer?
  • Look the options,
  • Build from the basics
  • Evaluate the new tools
  • See it work before you buy

38
  • Give me your card I will email the presentation
    tonight

39
Thank You
  • Marjorie M.K. Hlava
  • President, Access Innovations, Inc.
  • www.accessinn.com
  • Chairman, Data Harmony
  • mhlava_at_accessinn.com
  • 505-998-0800
  • Booth 150
Write a Comment
User Comments (0)
About PowerShow.com