Ontology Learning - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Ontology Learning

Description:

Tutorial at ECML/PKDD, Oct. 2005, Porto, Portugal. ... Co-occurrence (collocation) analysis for term extraction within the corpus ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 68
Provided by: waec
Category:

less

Transcript and Presenter's Notes

Title: Ontology Learning


1
Ontology Learning
in Integrated Logic Systems
Thomas Wächter Bioinformatics Group Biotec, TU
Dresden
  • Paul Buitelaar, Philipp Cimiano, Marko Grobelnik,
    Michael Sintek Ontology Learning from Text.
    Tutorial at ECML/PKDD, Oct. 2005, Porto,
    Portugal.
  • Steffen Staab Ontology Learning, Inaugural
    workshop of the language, interaction and
    computation lab, May 29, 2007, Rovereto, Italy

2
Outline
  • What is an Ontology?
  • Why Develop an Ontology?
  • Ontology Design Principles
  • Ontology Learning

3
What is an ontology?
4
Aristotle - Ontology
  • Before study of the nature of being
  • Since Aristotle study of knowledge
    representation and reasoning
  • Terminology
  • Genus (Classes)
  • Species (Subclasses)
  • Differentiae (Characteristics which allow to
    group or distinguish objects from each other)
  • Syllogisms (Inference Rules)
  • Aristotle Science of Being, Methapysics, IV, 1

5
What is an Ontology
  • An Ontology is a
  • formal specification ? Executable, Discussable
  • of a shared ? Group of persons
  • conceptualization ? About concepts abstract
    class
  • of a domain of interest ? e.g. an application,
    a specific area, the world model

Gruber 1993 - T.R. Gruber, Toward Principles
for the Design of Ontologies Used for Knowledge
Sharing, Formal Analysis in Conceptual Analysis
and Knowledge Representation, Kluwer, 1993.
6
Concept - Instance
  • Concept / Class / Universal (Metaphysics)
  • an abstract or general idea inferred or derived
    from specific instances
  • Instance / Individual / Particular (Metaphysics)
  • object in reality, a copy of a abstract concept
    with actual values for properties

Person
Person
Name Thomas Wächter
Studied Computer Science
LivesIn Dresden
WorksAt Biotec, TU-Dresden
7
Types of ontologies
Guarino et al. 1999 - N. Guarino, C. Masolo, G.
Vetere. OntoSeek Content-Based Access to the
Web. In IEEE Intelligent Systems, 14(3), 70--80,
1999.
8
Example Task Ontology
9
Existing Ontologies
  • General purpose ontologies
  • WordNet, http//www.cogsci.princeton.
  • semantic lexicon for the english language
  • EuroWordNet
  • multilingual database with wordnets for several
    European languages (Dutch, Italian,
  • Spanish, German, French, Czech and Estonian).
  • GermaNet
  • semantic lexicon for the german language
  • Upper level ontologies
  • Descriptive Ontology for Linguistic and Cognitive
    Engineering (DOLCE)
  • Upper-Cyc Ontology, http//www.cyc.com/ (300.000
    concepts, 3.000.000 facts and rules based on
    26000 relations)
  • IEEE Standard Upper Ontology, http//suo.ieee.org
  • Domain and application-specific ontologies
  • RDF Site Summary RSS, http//groups.yahoo.com/grou
    p/rss-dev/files/schema.rdf
  • Unified Medical Language System (UMLS),
    http//www.nlm.nih.gov/research/
  • Gene Ontology http//www.geneontology.org
  • Dublin Core, http//dublincore.org/

10
Example DUBLIN CORE
  • Used to describe digital materials such as video,
    sound, text or images

Title
Language
Publisher
Description
Identifier
Rights
Contributor
Relation
Coverage
Subject
Source
Creator
Format
Date
Type
11
Example Unified Modelling Language (UMLS)
  • understand the meaning of the language of
    biomedicine and health
  • Used to describe
  • Patient records
  • Scientific literature
  • Guidelines
  • Public heath data
  • Maintained by

12
Taxonomy
  • Taxonomy Segmentation, classification and
    ordering of elements into a classification system
    according to their relationships

13
Thesaurus
  • Terminology for a specific domain
  • Taxonomy plus fixed relationships (similar,
    synonym, related to)
  • originate from bibliography

14
Topic Map
  • Topics (nodes), relationships, and occurrences of
    documents
  • ISO-Standard
  • typically for navigation and visualisation

15
Ontology (in our sense)
  • Representation Languages RDF(S) OWL Predicate
    Logic F-Logic

16
Ontologies and their relatives
17
Ontologies and their relatives (2)
18
Why Develop an Ontology?
19
Why Develop an Ontology?
  • To make domain assumptions explicit
  • Easier to change domain assumptions
  • Easier to understand and update legacy data
  • liver as an term of human anatomy
  • keyboard as term for computer equipment
  • Enables data integration at different
    granularities
  • To separate domain knowledge from operational
    knowledge
  • Re-use domain and operational knowledge
    separately
  • domain liver is_a organ
  • domain caspase 8 is_a protein
  • domain caspase 8 involved_in programmed
    cell death
  • operational proteins have sequence data
  • operational protein sequences are input for
    EBI-BLAST

20
Why Develop an Ontology? (2)
  • A community reference for applications
  • To share a consistent understanding of what
    information means

Author name Thomas Wächter, TU-Dresden, Biotec
Protein EntrezGene -gt CASP8
21
CASP8 listed at the ENTREZ (NCBI)
Species
Biological process
Biological process
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Design Principles for Ontologies
27
Principle
  • Ontologies should be an agreement with the truths
    of basic science
  • caspase-8 is involved in apoptosis
  • menopause is part of death
  • Example from the Gene Ontology 2005

28
Principle
  • Use Aristotlean definitions
  • An A is a B which Cs
  • A researcher is a person who systematically
    investigates and studies materials and sources to
    establish facts and reach conclusions.www.spacef
    orspecies.ca/glossary/r.htm
  • A PhD student is a person who spends time on
    loads of things and finally hands in a thesis.

29
Principle
  • Avoid cyclic definitions.

Do not defined something with itself ! hemolysis
def. the causes of hemolysis Complicates
reasoning tasks Consistency checker will detect
inconsistency, but it remains open how to resolve
it.
30
Principle
  • Ontologies are made to capture domain knowledge
  • e.g. anatomy (FMA), chemical compounds (CHEBI),
    etc.

OPEN BIOMEDICAL ONTOLOGIES (http//obo.sourceforge
.net) Arabidopsis development, C. elegans
development, Drosophila development, Mammalian
phenotype, Mouse pathology, UniProt taxonomy,
Human disease,
31
Principle
  • Avoid modelling negative knowledge

32
Principle
  • Ontology terms should be used in literature (in
    written language)
  • Otherwise they are hardly reusable!!

33
Terms from the International Classification of
Diseases (ICD-10)
  • Accident to powered aircraft, other and
    unspecified, injuring occupant of military
    aircraft, any rank
  • Other accidental submersion or drowning in water
    transport accident injuring occupant of other
    watercraft crew
  • Other accidental submersion or drowning in water
    transport accident injuring other specified
    person

34
Terms from the International Classification of
Diseases (ICD-10)
  • Fall on stairs or ladders in water transport
    injuring occupant of small boat, unpowered
  • Nontraffic accident involving motor-driven snow
    vehicle injuring pedestrian
  • Railway accident involving collision with rolling
    stock and injuring pedal cyclist

35
The meaning of a word is defined by its use in
language
  • Ludwig Wittgenstein
  • Philosophical Investigations (1953)
  • Die Bedeutung eines Wortes ist sein Gebrauch in
    der Sprache

36
Principle
  • Ontologies should be constructed
    semi-automatically for and from a given corpus

37
As knowledge changes
From www.globalenvision.org
From homepage.mac.com/kvmagruder/flatEarth/flatEar
th.html
38
ontologies need to change
capturing scientific advances
Ontology Learning
Discovering the world model
39
Discovering the world model
40
Ontology Learning Layer Cake
41
Term Extraction
  • Determine most relevant phrases as terms
  • Linguistic Methods
  • Rules over linguistically analyzed text
  • Linguistic analysis Part-of-Speech Tagging,
    Morphological Analysis, Phrase Recognition
  • Extract patterns Adj-Noun, Noun-Noun,
    Adj-Noun-Noun ? AdjNoun
  • Statistical Methods
  • Co-occurrence (collocation) analysis for term
    extraction within the corpus
  • Comparison of frequencies between domain and
    general corpora
  • Lipoprotein lipase (LPL) will be very specific
    in a dietary research
  • tooth paste will be less specific
  • Hybrid Methods
  • Lingistic rules
  • Statistical (pre- or post-) filtering

42
Linguistic Methods
43
Statistical Methods
44
TFIDF
45
(No Transcript)
46
(No Transcript)
47
Ontology Learning Layer Cake
48
Multilingual Synonyms
  • Find terms that share (some) semantics, i.e.
    potentially refer to the same concept
  • Synonyms (within Languages)
  • Find term pair with similar meanings
  • doctor, medic, physician
  • person, patient, subject (domain clinical
    trials)
  • Translations (between Languages)
  • Dictionaries
  • doctor, Arzt, médico
  • Bilingual (Parallel) corpora, Comparable
    Corporae.g. documents using vocabulary in a
    similar context
  • Newspapers, magazines are often available in
    different languages

49
Ontology Learning Layer Cake
50
Concepts Intension, Extension, Lexicon
  • A term may indicate a concept, if we can define
    its
  • Intension
  • (in)formal definition of the set of objects that
    this concept describes
  • a disease is an impairment of health or a
    condition of abnormal functioning
  • Extension
  • a set of objects (instances) that the definition
    of this concept describes
  • influenza, cancer, cardiovascular disease
  • Extraction of instances from text referred to a
    Ontology Population
  • Relates to Knowledge Mark-up Tagging, Semantic
    Metadata
  • Uses Named-Entity Recognition and Information
    Extraction techniques
  • Lexical Realizations
  • the term itself and its multilingual synonyms
  • Disease, illness, Krankheit, enfermedad

51
Concepts Discussion
  • Instances can be e.g.
  • Names for objects
  • Names connected with events

52
Ontology Learning Layer Cake
53
Taxonomy Extraction
  • Lexico-syntactic patterns
  • Distributional Similarity Clustering
  • Linguistic Approaches
  • Document-subsumption
  • Taxonomy Extension/Refinement
  • Combination Opportunities

54
Hearst Patterns (by M. Hearst 1992)
  • Examples for hyponymy patterns
  • Vehicles such as cars, trucks and bikes vehicle
    ? car, truck, bike
  • Such fruits as oranges, nectarines or
    apples,fruit ? organge, nectarine, apple
  • Swimming, running and other activitiesactivity
    ? swimming, running
  • Publications, especially papers and
    bookspublication ? paper, book
  • A seabass is a fish.Fish ? seabass

55
Hearst Patterns (by M. Hearst 1992)
  • Examples for hyponymy patterns
  • NP such as NP, NP, ... and NP
  • Such NP as NP, NP, ... or NP
  • NP, NP, ... and other NP
  • NP, especially NP, NP ,... and NP
  • NP is a NP.
  • ...
  • Principle idea match these patterns in texts to
    retrieve is_a-relations
  • Pro simple
  • Con Precision evaluated on Wordnet 55,46
    (66/119)

56
Taxonomy Extraction
  • Lexico-syntactic patterns
  • Distributional Similarity Clustering
  • Linguistic Approaches
  • Document-subsumption
  • Taxonomy Extension/Refinement
  • Combination Opportunities

57
Clustering Concept Hierarchies from Text
  • Similarity-based
  • Based on binary similarity measures
  • Set-theoretical and Probabilistic
  • Based on probability of a term to occur together
    with others in set
  • Soft clustering
  • No strict assignment to a cluster. Applicable in
    case of disembogues termse.g. bank is a
    financial institute or natural object

58
Clustering Concept Hierarchies from Text
  • Similarity-based clustering
  • Similarity Measures
  • Binary Jaccard Distance size of the intersection
    divided by the size of the union
  • Geometric (Cosine, Eucledian / Manhatten
    distance)
  • Information-theoretic
  • Relative Entropy a measure of the difference
    between two probability distributions
  • Mutal Information the quantity that measures
    the mutual dependence of the two
    variablese.g. can one assume, that x and y
    occur conjointly, base on the probability that x
    and y occurring
  • low value for coffee, shop,
  • higher value for driver, bus
  • ()

59
Clustering Concept Hierarchies from Text
  • Linkage Strategies
  • Complete Linkage ? D(r,s) Max d(i,j) Where
    object i is in cluster r and object j in cluster
    s
  • Average Linkage? sum of all pairwise distances
    between cluster r and cluster s
  • Single Linkage? D(r,s) Min d(i,j) Where
    object i is in cluster r and object j is cluster
    s
  • Methods
  • Agglomerative (Bottom-Up)
  • Divisive (Top-Down)

From http//www.resample.com/xlminer/help/HClst/HC
lst_intro.htm
60
Bi-Section KMean Clustering
???
???
trip?
vehicle?
  • Task Find labels for clusters
  • Finding label for snippets of search results
    (check www.carrot2.org)
  • Labelling clusters with hypernyms found with
    hearst patterns
  • Which clusters correspond to a concept?

61
Taxonomy Extraction
  • Lexico-syntactic patterns
  • Distributional Similarity Clustering
  • Linguistic Approaches
  • Document-subsumption
  • Taxonomy Extension/Refinement
  • Combination Opportunities

62
Subsumption of terms
  • A term t1 subsumes a term t2, i.e. is-a(t2,t1)
    if t1 appears in all the documents in which t2
    appears
  • Probabilistic definitionis-a(y,x) iff P(xy) gt
    threshhold,n(x,y) number of documents, where
    x,y co-occurn(y) number of documents, which
    contain y

63
Taxonomy Extraction
  • Lexico-syntactic patterns
  • Distributional Similarity Clustering
  • Linguistic Approaches
  • Document-subsumption
  • Taxonomy Extension/Refinement
  • Combination Opportunities

64
Taxonomy Extension/Refinement
65
Taxonomy Extraction
  • Lexico-syntactic patterns
  • Distributional Similarity Clustering
  • Linguistic Approaches
  • Document-subsumption
  • Taxonomy Extension/Refinement
  • Combination Opportunities

66
Ontology Learning Layer Cake
67
Ontology Learning Layer Cake
68
Pipeline for discovering terminology
lipoprotein metabolism
Tokenization
Syntactic Filtering
  • 1 lipoprotein
  • cholesterol
  • 3 lipid
  • 4 metabolism
  • 5 insulin
  • CETP
  • 7 lipase
  • 8 LDL
  • 9 high-density
  • 10 glucose
  • 11 HDL
  • 12 triglyceride
  • 13 plasma
  • 14 risk
  • low-density

POS Tagging
Syntactic Grouping
Noun Phrase
Find nested concepts
Abbrevations
Local Frequency
Sentences
Global Frequency PubMed
Extract Concepts
Global Frequency Google
TFIDF Google/PubMed
Sort Concepts by TFIDF
69
References
  • Paul Buitelaar, Philipp Cimiano, Marko Grobelnik,
    Michael Sintek Ontology Learning from Text.
    Tutorial at ECML/PKDD, Oct. 2005, Porto,
    Portugal.
  • Steffen Staab Ontology Learning, Inaugural
    workshop of the language, interaction and
    computation lab, May 29, 2007, Rovereto, Italy
Write a Comment
User Comments (0)
About PowerShow.com