Semantics

About This Presentation

Title:

Semantics

Description:

Ontology is a conceptualization of domain into human understandable but machine ... extracted information is used to improve the ontology ... – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 29

Provided by: sfu5

Category:

more less

Transcript and Presenter's Notes

Title: Semantics

1
Semantics data mining document processing

Nima Kaviani
School of Interactive Arts and Technology
Simon Fraser University - SURREY

2
Towards Semantic Web Mining5

The idea is to combine two fast-developing
research areas, Semantic Web, and Web Mining.
Semantic Web can be used to improve the results
of web mining by exploiting new semantic
structures in the web
Web mining is useful to enhance the concepts and
instances by learning the definition of
structures for knowledge organization and to
provide the population of such knowledge
organization

3
Web Mining

Definition the application of data mining
techniques to the content, structure and usage of
web resources
Web Content Mining a form of text mining to
extract data from content of the web page
Web Structure Mining extracts information reside
in the structure of hypertext (the idea behind
links and also the usage for page rankings)
Web Usage Mining the web resource that is being
mined is the record of the requests made by the
user to capture the user behaviors

4
Semantic Web

Definition to add semantic annotation to web
documents so that they can be easily understand
by human and read by machines for further
inferences
Ontology Learning semi-automatic extraction of
semantics from the web to create an ontology.
Mapping and Merging Ontologies to merge
different ontologies and build a new domain
specific ontology (described by Davis)
Instance Learning automatic or semi-automatic
methods to extract information from web-related
documents, either to help in annotating new
documents or to extract additional information
from existing unstructured or partially
structured documents.

5
Creating an Ontology

Ontology is a conceptualization of domain into
human understandable but machine readable
formats. A quadruple of entities, attributes,
relationships and axioms. 3
Steps in creating an ontology for the data 8
determining the scope of the ontology
reusing existing Ontologies
enumerating all the concepts needed
defining the taxonomy
defining the properties
defining facets of the concepts
defining instances

Are normally Performed by Ontology Engineer
Can be performed semi-automatically
6
Ontology Learning

Why do we try to make the ontology learning
automatic (semi-automatic)?
The source data is usually stored as
unstructured, semi-structured (HTML, XML) or
structured (Data Bases) format and should be
processed in order to be used in creating the
ontology3.
Laborious and cumbersome task
Time consuming
Dynamic nature of available domains
Lack of tools and guidelines 1

7
Semi-Automatic Ontology Learning

It aims to integrate multitude of disciplines in
order to facilitate the construction of
Ontologies12. because of tacit information
available, human intervention is always required
5.
Steps in Building an ontology automatically
Acquisition of concepts
Establishment of concept taxonomies
Discovering of non-taxonomic conceptual relations
Pruning the generated ontology

8
Acquisition of concepts and establishing
taxonomic relations

Using IR and NLP techniques, concepts can be
extracted quite efficiently.
Techniques are normally a combination of methods
below with a tendency to consider one of them
more effectively.
Computational Linguistic
Information Retrieval

9
Term Definitions

Recall
Fraction of known relevant documents which were
effectively retrieved.
Precision
Fraction of retrieved documents which are known
to be relevant
TFIDF
The tf-idf weight (term frequency - inverse
document frequency) is a weight often used in
Information Retrieval

10
Methods used in acquiring the concepts-1

Computational linguistic approach
Pre-processing the text to extract dependencies
and single-word nouns
POS-tagger 11 (part-of-speech dependency
parser) 2, 4
Minipar (State-of-the-art dependency) 1
Extracting multi-word noun phrases 2
Shallow parse the text
Filter out word phrases with interesting POS-tag
patterns
Decide for each phrase whether it is a noun
phrases
Extracting taxonomical relations14
Uses regular expressions to find ISA relations
Defining regular expression relations like
NP , NP , or other NP
Bruises, wounds, broken bones, or other injuries
The overall process is a combination of the tools
below12
Tokenizer Regular expressions to find nouns
Lexicon as a big repository for stems
Lexical analyzer mixes results from the two
methods above and extracts new concepts
Chunk parse works on phrases to generate
syntactic dependency relations-uses POS-tagger.
Heuristics includes correlations beside
linguistic-base dependency relations.

11
Methods used in acquiring the concepts-2

An information retrieval method using term
weightings 12
Counting relevant terms and extract the more
frequent ones as concepts
lefl,d the frequency of appearance of term l in
the document d
dfl the number of documents in the corpus D that
term l occurs in
cfl the total number of occurrences of term l
in the corpus D
Methods to find the taxonomy
Clustering (starts from scratch and uses
distributional data about words)
Classification (uses an available hierarchy and
refines it)
Lexico-Syntactic (regular expressions)

12
Methods used in acquiring the concepts-3

Combines information extraction with Ontologies
and bootstraps9
ontology is used to improve the quality of
extraction
extracted information is used to improve the
ontology
the idea is to use indicative terms to find
informative terms and then to use informative
terms to find new indicators
it is trying to extract a pattern to make
indicators and informative concepts relevant

13
Methods used in acquiring the concepts-4

Specific purpose concept and taxonomy extraction
7
Methodology neighborhood of initial keywords
The anterior word of a word classifies it (in
English)
The posterior word of a word represents the
domain (in English)
coronary heart disease
Sends the query to the search engine and extracts
anterior and posterior words of a word and
decides on if the word is an instance or
subclass.
Clustering is performed according to the
coincidence amount
Synonymy is satisfied by using constraints and
omitting the initial word

14
Current Status

Results
IR and Computational Linguistic can solve the
problem
Current methods are trying to derive concepts and
form taxonomical relations using the biggest
available corpus, World Wide Web.
Problems to be solved
Current efforts are mostly using hand-crafted
concept hierarchies
Hardly can find synonyms for a set of available
concepts.
Hardly can make the process of discovering
synonyms automatic using currently found synonyms

15
Establishment of non-taxonomic relations between
concepts

The most important and challenging task in
building an ontology.
Finding data concepts and taxonomic relations are
simpler in comparison to construct non-taxonomic
relation between concepts.
These approaches are generally a combination of
Natural Language Processing and Machine Learning

16
Methods proposed to establish relations-1

Clustering 13
ASIUM a software designed based on unsupervised
clustering method
Does not require any annotation of texts by hand
Learns knowledge in the form of
Subcategorization frames
ltto travelgt ltsubject humangt ltby vehiclegt
subject is the syntactic role
by is the proposition
human and vehicle are restrictions of their
selection
Ontologies

17
Methods proposed to establish relations-1

Pre-Processing the text
SYLEX provides training text which is attachment
of verbs to noun phrases and clauses.
The first step is done by getting the training
text as input and generating instantiated
Subcategorization frames as output.
ltverbgt
ltsubjectgt
ltobjectgt

18
Methods proposed to establish relations-1

Clustering Algorithm
Factorizing similar instantiated
subcategorization frames
Clustering algorithm used in ASIUM
Links represent generality relations
Breadth-First
Bottom-up clustering
Two classes are aggregated
Distance is defined as the portion of common head
words in the two clusters taking into account
their frequencies
Clusters with a distance less than the threshold
are aggregated
The threshold doesnt change in different levels
Available clusters, only in the same level, are
taken into account

19
Methods proposed to establish relations-1

card(c1) and card(c2) the number of different
head words in cluster C1 and C2
Ncomm the number of different common head words
between C1 and C2
is the sum of the frequencies of the
head words of Cj
wordiCj is the i-th head word of cluster Cj
f(wordiCj) is its frequency
minimizes the influences of word
frequencies

20
Methods proposed to establish relations-1

This generality results in change of instantiated
Subcategorization frames into Subcategorization
frames
Cooperation of user in the process of building
the ontology is required
User labels the clusters
User validates the new clusters
Rejects those words that restrict the given verbs
Partitions new clusters into sub-clusters which
would not have been identified before
Clusters in each level must get validated before
proceeding to the next level
User can partition the clusters and label
sub-concepts if he find the newly generated
classes useless or meaningless

21
Example
father
neighbor
father
mother
Passenger
Subjects
verbs
drive
travel
proposition
using
by
by
Objects
train
motorbike
car
car
bicycle
factorizing
car, train, motorbike
car, bicycle
car, train, motorbike, bicycle
clustering
Motorized vehicle
22
Methods proposed to establish relations-2

Generalized association rules 10
A set of transactions are defined
Each transaction consists of a set of items where
each item is from a set of concepts
Two factors are considered in estimating amount
of relevancy of two different concepts Xk and Yk
in an association rule
Support percentage of transactions that contain
Xk and Yk as a subset
Confidence percentage of transactions that Yk is
seen when Xk appears in a transaction
Some changes have been applied to the basic
association rule algorithm to make it suitable
for associations at the right level of the
taxonomy

23
Methods proposed to establish relations-3

Fuzzy Formal Concept Analysis (FFCA)3
FCA is based on lattice theory and is used for
conceptual knowledge discovery
Hierarchical relationship of concepts is
organized as a lattice rather than a tree
The method uses a citation database to generate
concepts
Steps in generating ontology using this method
are
FFCA
Concept Clustering
Ontology generation

24
Current Status

Results
Methods proposed have reduced the amount of
effort by a human engineer
Problems to be solved
They all consider a single-layer generalization,
however, in many case a multi-layer
generalization would result in a better hierarchy
Still human plays a key role in designing the
ontology and the quality of the design depends on
his works

25
Pruning the generated hierarchy

The generated ontology contains concepts that are
not interesting and should be removed.
Methods used to remove uninteresting nodes are
Using a rule based method according to the
following condition 6
Nodes without a domain node are removed
Intermediate nodes with the following properties
are removed
Nodes without siblings
Its not the root of any concept
Conditions which are held in the ontology
Using IR techniques12
Considering term frequencies, comparing the
frequency of the current term with the frequency
in a generic corpus, and removing the term if its
frequency in the domain is lower than that of the
term in a generic corpus

26
Conclusion

A progress in building ontologies with web-pages
rather than static texts as their instances is
seen.
There is not a clear and defined way to evaluate
automatically built ontologies and these
ontologies are compared with hand-crafted ones.
The above fact hampers the comparison between two
semi-automatically built ontologies

27
References

Sabou, M., Wroe, C., Goble, C., and Mishne, G.
Learning domain ontologies for Web service
descriptions an experiment in bioinformatics. In
Proceedings of the 14th international Conference
on World Wide Web (Chiba, Japan, May 10 - 14,
2005). WWW '05. ACM Press, New York, NY, 2005.
van Hage, W. R., de Rijke, M., Marx M.,
Information Retrieval Support for Ontology
Construction and Use. In Proceedings of the 3rd
International Semantic Web Conference, Jan 2004,
Pages 518 533, LNCS, Springer 2004.
Quan, T. T. , Hu,i S. C., Fong, A.C.M., Cao, T.
H. Automatic Generation of Ontology for Scholarly
Semantic Web. In Proceedings of the 3rd
International Semantic Web Conference, Jan 2004,
Pages 726 740, LNCS, Springer 2004.
Sabou, M., Wroe, C., Goble, C., and Mishne, G.
Learning domain ontologies for Web service
descriptions an experiment in bioinformatics. In
Proceedings of the 14th international Conference
on World Wide Web (Chiba, Japan, May 10 - 14,
2005). WWW '05. ACM Press, New York, NY, 2005.
Berendt, B., Hotho, A., and Stumme, G. Towards
semantic web mining. In I. Horrocks and J.
Hendler (Eds.), The Semantic Web - ISWC 2002. In
Proceedings of the 1st International Semantic Web
Conference, June 9-12th, 2002, Sardinia, Italy,
pages 264--278. LNCS, Heidelberg, Germany
Springer, 2002.

28
References

Navigli, R. and Velardi, P. Learning Domain
Ontologies from Document Warehouses and Dedicated
Web Sites. In Computational Linguistics, Volume
30, Issue 2. June 2004.
Sanchez, D. and Moreno, A. Web Mining Techniques
for Automatic Discovery of Medical Knowledge. In
Proceedings of Artificial Intelligence in
Medicine, 10th Conference on Artificial
Intelligence in Medicine, AIME 2005, Aberdeen,
UK, July 23-27, 2005.
Noy, N. F., and McGuinness, D. L. . Ontology
Development 101 A Guide to Creating Your First
Ontology. Knowledge Systems Laboratory, March,
2001.
Kavalec, M., Svatek, V. Information Extraction
and Ontology Learning Guided by Web Directory. In
ECAI Workshop on NLP and ML for ontology
engineering, Lyon 2002.
Maedche, A. and Staab, S. 2000. Mining Ontologies
from Text. In Proceedings of the 12th European
Workshop on Knowledge Acquisition, Modeling and
Management. Pages 189-202, LNCS, vol. 1937.
Springer, London, 2000.
Schmid, H. Probabilistic part-of-speech tagging
using decision trees. In International Conference
on New Methods in Language Processing, pages
44--49, Manchester, UK, 1994.

29
References

Maedche, A. and Staab, S. 2001. Ontology Learning
for the Semantic Web. IEEE Intelligent Systems
16, 2, Mar. 2001.
Faure, D. and N'edellec, C. ASIUM Learning
subcategorization frames and restrictions of
selection. In the 10th Conference on Machine
Learning (ECML 98) -- Workshop on Text Mining,
Chemnitz, Germany, April 1998.
Hearst, M. Automatic Acquisition of Hyponyms from
Large Text Corpora. In Proceedings of the 14th
International Conference on Computational
Linguistics, Nantes, France, 1992.