Title: Ontology Learning
1Ontology Learning
in Integrated Logic Systems
Thomas Wächter Bioinformatics Group Biotec, TU
Dresden
- Paul Buitelaar, Philipp Cimiano, Marko Grobelnik,
Michael Sintek Ontology Learning from Text.
Tutorial at ECML/PKDD, Oct. 2005, Porto,
Portugal. - Steffen Staab Ontology Learning, Inaugural
workshop of the language, interaction and
computation lab, May 29, 2007, Rovereto, Italy
2Outline
- What is an Ontology?
- Why Develop an Ontology?
- Ontology Design Principles
- Ontology Learning
3What is an ontology?
4Aristotle - Ontology
- Before study of the nature of being
- Since Aristotle study of knowledge
representation and reasoning - Terminology
- Genus (Classes)
- Species (Subclasses)
- Differentiae (Characteristics which allow to
group or distinguish objects from each other) - Syllogisms (Inference Rules)
- Aristotle Science of Being, Methapysics, IV, 1
5What is an Ontology
- An Ontology is a
- formal specification ? Executable, Discussable
- of a shared ? Group of persons
- conceptualization ? About concepts abstract
class - of a domain of interest ? e.g. an application,
a specific area, the world model
Gruber 1993 - T.R. Gruber, Toward Principles
for the Design of Ontologies Used for Knowledge
Sharing, Formal Analysis in Conceptual Analysis
and Knowledge Representation, Kluwer, 1993.
6Concept - Instance
- Concept / Class / Universal (Metaphysics)
- an abstract or general idea inferred or derived
from specific instances - Instance / Individual / Particular (Metaphysics)
- object in reality, a copy of a abstract concept
with actual values for properties
Person
Person
Name Thomas Wächter
Studied Computer Science
LivesIn Dresden
WorksAt Biotec, TU-Dresden
7Types of ontologies
Guarino et al. 1999 - N. Guarino, C. Masolo, G.
Vetere. OntoSeek Content-Based Access to the
Web. In IEEE Intelligent Systems, 14(3), 70--80,
1999.
8Example Task Ontology
9Existing Ontologies
- General purpose ontologies
- WordNet, http//www.cogsci.princeton.
- semantic lexicon for the english language
- EuroWordNet
- multilingual database with wordnets for several
European languages (Dutch, Italian, - Spanish, German, French, Czech and Estonian).
- GermaNet
- semantic lexicon for the german language
- Upper level ontologies
- Descriptive Ontology for Linguistic and Cognitive
Engineering (DOLCE) - Upper-Cyc Ontology, http//www.cyc.com/ (300.000
concepts, 3.000.000 facts and rules based on
26000 relations) - IEEE Standard Upper Ontology, http//suo.ieee.org
- Domain and application-specific ontologies
- RDF Site Summary RSS, http//groups.yahoo.com/grou
p/rss-dev/files/schema.rdf - Unified Medical Language System (UMLS),
http//www.nlm.nih.gov/research/ - Gene Ontology http//www.geneontology.org
- Dublin Core, http//dublincore.org/
10Example DUBLIN CORE
- Used to describe digital materials such as video,
sound, text or images
Title
Language
Publisher
Description
Identifier
Rights
Contributor
Relation
Coverage
Subject
Source
Creator
Format
Date
Type
11Example Unified Modelling Language (UMLS)
- understand the meaning of the language of
biomedicine and health - Used to describe
- Patient records
- Scientific literature
- Guidelines
- Public heath data
- Maintained by
12Taxonomy
- Taxonomy Segmentation, classification and
ordering of elements into a classification system
according to their relationships
13Thesaurus
- Terminology for a specific domain
- Taxonomy plus fixed relationships (similar,
synonym, related to) - originate from bibliography
14Topic Map
- Topics (nodes), relationships, and occurrences of
documents - ISO-Standard
- typically for navigation and visualisation
15Ontology (in our sense)
- Representation Languages RDF(S) OWL Predicate
Logic F-Logic
16Ontologies and their relatives
17Ontologies and their relatives (2)
18Why Develop an Ontology?
19Why Develop an Ontology?
- To make domain assumptions explicit
- Easier to change domain assumptions
- Easier to understand and update legacy data
- liver as an term of human anatomy
- keyboard as term for computer equipment
- Enables data integration at different
granularities - To separate domain knowledge from operational
knowledge - Re-use domain and operational knowledge
separately - domain liver is_a organ
- domain caspase 8 is_a protein
- domain caspase 8 involved_in programmed
cell death - operational proteins have sequence data
- operational protein sequences are input for
EBI-BLAST
20Why Develop an Ontology? (2)
- A community reference for applications
- To share a consistent understanding of what
information means
Author name Thomas Wächter, TU-Dresden, Biotec
Protein EntrezGene -gt CASP8
21CASP8 listed at the ENTREZ (NCBI)
Species
Biological process
Biological process
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Design Principles for Ontologies
27Principle
- Ontologies should be an agreement with the truths
of basic science - caspase-8 is involved in apoptosis
- menopause is part of death
- Example from the Gene Ontology 2005
28Principle
- Use Aristotlean definitions
- An A is a B which Cs
- A researcher is a person who systematically
investigates and studies materials and sources to
establish facts and reach conclusions.www.spacef
orspecies.ca/glossary/r.htm - A PhD student is a person who spends time on
loads of things and finally hands in a thesis.
29Principle
- Avoid cyclic definitions.
Do not defined something with itself ! hemolysis
def. the causes of hemolysis Complicates
reasoning tasks Consistency checker will detect
inconsistency, but it remains open how to resolve
it.
30Principle
- Ontologies are made to capture domain knowledge
- e.g. anatomy (FMA), chemical compounds (CHEBI),
etc.
OPEN BIOMEDICAL ONTOLOGIES (http//obo.sourceforge
.net) Arabidopsis development, C. elegans
development, Drosophila development, Mammalian
phenotype, Mouse pathology, UniProt taxonomy,
Human disease,
31Principle
- Avoid modelling negative knowledge
32Principle
- Ontology terms should be used in literature (in
written language) - Otherwise they are hardly reusable!!
33Terms from the International Classification of
Diseases (ICD-10)
- Accident to powered aircraft, other and
unspecified, injuring occupant of military
aircraft, any rank - Other accidental submersion or drowning in water
transport accident injuring occupant of other
watercraft crew - Other accidental submersion or drowning in water
transport accident injuring other specified
person
34Terms from the International Classification of
Diseases (ICD-10)
- Fall on stairs or ladders in water transport
injuring occupant of small boat, unpowered - Nontraffic accident involving motor-driven snow
vehicle injuring pedestrian - Railway accident involving collision with rolling
stock and injuring pedal cyclist
35The meaning of a word is defined by its use in
language
- Ludwig Wittgenstein
- Philosophical Investigations (1953)
- Die Bedeutung eines Wortes ist sein Gebrauch in
der Sprache
36Principle
- Ontologies should be constructed
semi-automatically for and from a given corpus
37As knowledge changes
From www.globalenvision.org
From homepage.mac.com/kvmagruder/flatEarth/flatEar
th.html
38 ontologies need to change
capturing scientific advances
Ontology Learning
Discovering the world model
39Discovering the world model
40Ontology Learning Layer Cake
41Term Extraction
- Determine most relevant phrases as terms
- Linguistic Methods
- Rules over linguistically analyzed text
- Linguistic analysis Part-of-Speech Tagging,
Morphological Analysis, Phrase Recognition - Extract patterns Adj-Noun, Noun-Noun,
Adj-Noun-Noun ? AdjNoun - Statistical Methods
- Co-occurrence (collocation) analysis for term
extraction within the corpus - Comparison of frequencies between domain and
general corpora - Lipoprotein lipase (LPL) will be very specific
in a dietary research - tooth paste will be less specific
- Hybrid Methods
- Lingistic rules
- Statistical (pre- or post-) filtering
42Linguistic Methods
43Statistical Methods
44TFIDF
45(No Transcript)
46(No Transcript)
47Ontology Learning Layer Cake
48Multilingual Synonyms
- Find terms that share (some) semantics, i.e.
potentially refer to the same concept - Synonyms (within Languages)
- Find term pair with similar meanings
- doctor, medic, physician
- person, patient, subject (domain clinical
trials) - Translations (between Languages)
- Dictionaries
- doctor, Arzt, médico
- Bilingual (Parallel) corpora, Comparable
Corporae.g. documents using vocabulary in a
similar context - Newspapers, magazines are often available in
different languages
49Ontology Learning Layer Cake
50Concepts Intension, Extension, Lexicon
- A term may indicate a concept, if we can define
its - Intension
- (in)formal definition of the set of objects that
this concept describes - a disease is an impairment of health or a
condition of abnormal functioning - Extension
- a set of objects (instances) that the definition
of this concept describes - influenza, cancer, cardiovascular disease
- Extraction of instances from text referred to a
Ontology Population - Relates to Knowledge Mark-up Tagging, Semantic
Metadata - Uses Named-Entity Recognition and Information
Extraction techniques - Lexical Realizations
- the term itself and its multilingual synonyms
- Disease, illness, Krankheit, enfermedad
51Concepts Discussion
- Instances can be e.g.
- Names for objects
- Names connected with events
52Ontology Learning Layer Cake
53Taxonomy Extraction
- Lexico-syntactic patterns
- Distributional Similarity Clustering
- Linguistic Approaches
- Document-subsumption
- Taxonomy Extension/Refinement
- Combination Opportunities
54Hearst Patterns (by M. Hearst 1992)
- Examples for hyponymy patterns
- Vehicles such as cars, trucks and bikes vehicle
? car, truck, bike - Such fruits as oranges, nectarines or
apples,fruit ? organge, nectarine, apple - Swimming, running and other activitiesactivity
? swimming, running - Publications, especially papers and
bookspublication ? paper, book - A seabass is a fish.Fish ? seabass
55Hearst Patterns (by M. Hearst 1992)
- Examples for hyponymy patterns
- NP such as NP, NP, ... and NP
- Such NP as NP, NP, ... or NP
- NP, NP, ... and other NP
- NP, especially NP, NP ,... and NP
- NP is a NP.
- ...
- Principle idea match these patterns in texts to
retrieve is_a-relations - Pro simple
- Con Precision evaluated on Wordnet 55,46
(66/119)
56Taxonomy Extraction
- Lexico-syntactic patterns
- Distributional Similarity Clustering
- Linguistic Approaches
- Document-subsumption
- Taxonomy Extension/Refinement
- Combination Opportunities
57Clustering Concept Hierarchies from Text
- Similarity-based
- Based on binary similarity measures
- Set-theoretical and Probabilistic
- Based on probability of a term to occur together
with others in set - Soft clustering
- No strict assignment to a cluster. Applicable in
case of disembogues termse.g. bank is a
financial institute or natural object
58Clustering Concept Hierarchies from Text
- Similarity-based clustering
- Similarity Measures
- Binary Jaccard Distance size of the intersection
divided by the size of the union - Geometric (Cosine, Eucledian / Manhatten
distance) - Information-theoretic
- Relative Entropy a measure of the difference
between two probability distributions - Mutal Information the quantity that measures
the mutual dependence of the two
variablese.g. can one assume, that x and y
occur conjointly, base on the probability that x
and y occurring - low value for coffee, shop,
- higher value for driver, bus
- ()
59Clustering Concept Hierarchies from Text
- Linkage Strategies
- Complete Linkage ? D(r,s) Max d(i,j) Where
object i is in cluster r and object j in cluster
s - Average Linkage? sum of all pairwise distances
between cluster r and cluster s - Single Linkage? D(r,s) Min d(i,j) Where
object i is in cluster r and object j is cluster
s - Methods
- Agglomerative (Bottom-Up)
- Divisive (Top-Down)
From http//www.resample.com/xlminer/help/HClst/HC
lst_intro.htm
60Bi-Section KMean Clustering
???
???
trip?
vehicle?
- Task Find labels for clusters
- Finding label for snippets of search results
(check www.carrot2.org) - Labelling clusters with hypernyms found with
hearst patterns - Which clusters correspond to a concept?
61Taxonomy Extraction
- Lexico-syntactic patterns
- Distributional Similarity Clustering
- Linguistic Approaches
- Document-subsumption
- Taxonomy Extension/Refinement
- Combination Opportunities
62Subsumption of terms
- A term t1 subsumes a term t2, i.e. is-a(t2,t1)
if t1 appears in all the documents in which t2
appears - Probabilistic definitionis-a(y,x) iff P(xy) gt
threshhold,n(x,y) number of documents, where
x,y co-occurn(y) number of documents, which
contain y
63Taxonomy Extraction
- Lexico-syntactic patterns
- Distributional Similarity Clustering
- Linguistic Approaches
- Document-subsumption
- Taxonomy Extension/Refinement
- Combination Opportunities
64Taxonomy Extension/Refinement
65Taxonomy Extraction
- Lexico-syntactic patterns
- Distributional Similarity Clustering
- Linguistic Approaches
- Document-subsumption
- Taxonomy Extension/Refinement
- Combination Opportunities
66Ontology Learning Layer Cake
67Ontology Learning Layer Cake
68Pipeline for discovering terminology
lipoprotein metabolism
Tokenization
Syntactic Filtering
- 1 lipoprotein
- cholesterol
- 3 lipid
- 4 metabolism
- 5 insulin
- CETP
- 7 lipase
- 8 LDL
- 9 high-density
- 10 glucose
- 11 HDL
- 12 triglyceride
- 13 plasma
- 14 risk
- low-density
-
POS Tagging
Syntactic Grouping
Noun Phrase
Find nested concepts
Abbrevations
Local Frequency
Sentences
Global Frequency PubMed
Extract Concepts
Global Frequency Google
TFIDF Google/PubMed
Sort Concepts by TFIDF
69References
- Paul Buitelaar, Philipp Cimiano, Marko Grobelnik,
Michael Sintek Ontology Learning from Text.
Tutorial at ECML/PKDD, Oct. 2005, Porto,
Portugal. - Steffen Staab Ontology Learning, Inaugural
workshop of the language, interaction and
computation lab, May 29, 2007, Rovereto, Italy