Title: Semantics
1Semantics data mining document processing
- Nima Kaviani
- School of Interactive Arts and Technology
- Simon Fraser University - SURREY
2Towards Semantic Web Mining5
- The idea is to combine two fast-developing
research areas, Semantic Web, and Web Mining. - Semantic Web can be used to improve the results
of web mining by exploiting new semantic
structures in the web - Web mining is useful to enhance the concepts and
instances by learning the definition of
structures for knowledge organization and to
provide the population of such knowledge
organization
3Web Mining
- Definition the application of data mining
techniques to the content, structure and usage of
web resources - Web Content Mining a form of text mining to
extract data from content of the web page - Web Structure Mining extracts information reside
in the structure of hypertext (the idea behind
links and also the usage for page rankings) - Web Usage Mining the web resource that is being
mined is the record of the requests made by the
user to capture the user behaviors
4Semantic Web
- Definition to add semantic annotation to web
documents so that they can be easily understand
by human and read by machines for further
inferences - Ontology Learning semi-automatic extraction of
semantics from the web to create an ontology. - Mapping and Merging Ontologies to merge
different ontologies and build a new domain
specific ontology (described by Davis) - Instance Learning automatic or semi-automatic
methods to extract information from web-related
documents, either to help in annotating new
documents or to extract additional information
from existing unstructured or partially
structured documents.
5Creating an Ontology
- Ontology is a conceptualization of domain into
human understandable but machine readable
formats. A quadruple of entities, attributes,
relationships and axioms. 3 - Steps in creating an ontology for the data 8
- determining the scope of the ontology
- reusing existing Ontologies
- enumerating all the concepts needed
- defining the taxonomy
- defining the properties
- defining facets of the concepts
- defining instances
Are normally Performed by Ontology Engineer
Can be performed semi-automatically
6Ontology Learning
- Why do we try to make the ontology learning
automatic (semi-automatic)? - The source data is usually stored as
unstructured, semi-structured (HTML, XML) or
structured (Data Bases) format and should be
processed in order to be used in creating the
ontology3. - Laborious and cumbersome task
- Time consuming
- Dynamic nature of available domains
- Lack of tools and guidelines 1
7Semi-Automatic Ontology Learning
- It aims to integrate multitude of disciplines in
order to facilitate the construction of
Ontologies12. because of tacit information
available, human intervention is always required
5. - Steps in Building an ontology automatically
- Acquisition of concepts
- Establishment of concept taxonomies
- Discovering of non-taxonomic conceptual relations
- Pruning the generated ontology
8Acquisition of concepts and establishing
taxonomic relations
- Using IR and NLP techniques, concepts can be
extracted quite efficiently. - Techniques are normally a combination of methods
below with a tendency to consider one of them
more effectively. - Computational Linguistic
- Information Retrieval
9Term Definitions
- Recall
- Fraction of known relevant documents which were
effectively retrieved. - Precision
- Fraction of retrieved documents which are known
to be relevant - TFIDF
- The tf-idf weight (term frequency - inverse
document frequency) is a weight often used in
Information Retrieval
10Methods used in acquiring the concepts-1
- Computational linguistic approach
- Pre-processing the text to extract dependencies
and single-word nouns - POS-tagger 11 (part-of-speech dependency
parser) 2, 4 - Minipar (State-of-the-art dependency) 1
- Extracting multi-word noun phrases 2
- Shallow parse the text
- Filter out word phrases with interesting POS-tag
patterns - Decide for each phrase whether it is a noun
phrases - Extracting taxonomical relations14
- Uses regular expressions to find ISA relations
- Defining regular expression relations like
- NP , NP , or other NP
- Bruises, wounds, broken bones, or other injuries
- The overall process is a combination of the tools
below12 - Tokenizer Regular expressions to find nouns
- Lexicon as a big repository for stems
- Lexical analyzer mixes results from the two
methods above and extracts new concepts - Chunk parse works on phrases to generate
syntactic dependency relations-uses POS-tagger. - Heuristics includes correlations beside
linguistic-base dependency relations.
11Methods used in acquiring the concepts-2
- An information retrieval method using term
weightings 12 - Counting relevant terms and extract the more
frequent ones as concepts - lefl,d the frequency of appearance of term l in
the document d - dfl the number of documents in the corpus D that
term l occurs in - cfl the total number of occurrences of term l
in the corpus D -
-
- Methods to find the taxonomy
- Clustering (starts from scratch and uses
distributional data about words) - Classification (uses an available hierarchy and
refines it) - Lexico-Syntactic (regular expressions)
12Methods used in acquiring the concepts-3
- Combines information extraction with Ontologies
and bootstraps9 - ontology is used to improve the quality of
extraction - extracted information is used to improve the
ontology - the idea is to use indicative terms to find
informative terms and then to use informative
terms to find new indicators - it is trying to extract a pattern to make
indicators and informative concepts relevant
13Methods used in acquiring the concepts-4
- Specific purpose concept and taxonomy extraction
7 - Methodology neighborhood of initial keywords
- The anterior word of a word classifies it (in
English) - The posterior word of a word represents the
domain (in English) - coronary heart disease
- Sends the query to the search engine and extracts
anterior and posterior words of a word and
decides on if the word is an instance or
subclass. - Clustering is performed according to the
coincidence amount - Synonymy is satisfied by using constraints and
omitting the initial word
14Current Status
- Results
- IR and Computational Linguistic can solve the
problem - Current methods are trying to derive concepts and
form taxonomical relations using the biggest
available corpus, World Wide Web. - Problems to be solved
- Current efforts are mostly using hand-crafted
concept hierarchies - Hardly can find synonyms for a set of available
concepts. - Hardly can make the process of discovering
synonyms automatic using currently found synonyms
15Establishment of non-taxonomic relations between
concepts
- The most important and challenging task in
building an ontology. - Finding data concepts and taxonomic relations are
simpler in comparison to construct non-taxonomic
relation between concepts. - These approaches are generally a combination of
Natural Language Processing and Machine Learning
16Methods proposed to establish relations-1
- Clustering 13
- ASIUM a software designed based on unsupervised
clustering method - Does not require any annotation of texts by hand
- Learns knowledge in the form of
- Subcategorization frames
- ltto travelgt ltsubject humangt ltby vehiclegt
- subject is the syntactic role
- by is the proposition
- human and vehicle are restrictions of their
selection - Ontologies
17Methods proposed to establish relations-1
- Pre-Processing the text
- SYLEX provides training text which is attachment
of verbs to noun phrases and clauses. - The first step is done by getting the training
text as input and generating instantiated
Subcategorization frames as output. - ltverbgt
- ltsubjectgt
- ltobjectgt
18Methods proposed to establish relations-1
- Clustering Algorithm
- Factorizing similar instantiated
subcategorization frames - Clustering algorithm used in ASIUM
- Links represent generality relations
- Breadth-First
- Bottom-up clustering
- Two classes are aggregated
- Distance is defined as the portion of common head
words in the two clusters taking into account
their frequencies - Clusters with a distance less than the threshold
are aggregated - The threshold doesnt change in different levels
- Available clusters, only in the same level, are
taken into account
19Methods proposed to establish relations-1
- card(c1) and card(c2) the number of different
head words in cluster C1 and C2 - Ncomm the number of different common head words
between C1 and C2 - is the sum of the frequencies of the
head words of Cj - wordiCj is the i-th head word of cluster Cj
- f(wordiCj) is its frequency
- minimizes the influences of word
frequencies -
20Methods proposed to establish relations-1
- This generality results in change of instantiated
Subcategorization frames into Subcategorization
frames - Cooperation of user in the process of building
the ontology is required - User labels the clusters
- User validates the new clusters
- Rejects those words that restrict the given verbs
- Partitions new clusters into sub-clusters which
would not have been identified before - Clusters in each level must get validated before
proceeding to the next level - User can partition the clusters and label
sub-concepts if he find the newly generated
classes useless or meaningless
21Example
father
neighbor
father
mother
Passenger
Subjects
verbs
drive
travel
proposition
using
by
by
Objects
train
motorbike
car
car
bicycle
factorizing
car, train, motorbike
car, bicycle
car, train, motorbike, bicycle
clustering
Motorized vehicle
22Methods proposed to establish relations-2
- Generalized association rules 10
- A set of transactions are defined
- Each transaction consists of a set of items where
each item is from a set of concepts - Two factors are considered in estimating amount
of relevancy of two different concepts Xk and Yk
in an association rule - Support percentage of transactions that contain
Xk and Yk as a subset - Confidence percentage of transactions that Yk is
seen when Xk appears in a transaction - Some changes have been applied to the basic
association rule algorithm to make it suitable
for associations at the right level of the
taxonomy
23Methods proposed to establish relations-3
- Fuzzy Formal Concept Analysis (FFCA)3
- FCA is based on lattice theory and is used for
conceptual knowledge discovery - Hierarchical relationship of concepts is
organized as a lattice rather than a tree - The method uses a citation database to generate
concepts - Steps in generating ontology using this method
are - FFCA
- Concept Clustering
- Ontology generation
24Current Status
- Results
- Methods proposed have reduced the amount of
effort by a human engineer - Problems to be solved
- They all consider a single-layer generalization,
however, in many case a multi-layer
generalization would result in a better hierarchy - Still human plays a key role in designing the
ontology and the quality of the design depends on
his works
25Pruning the generated hierarchy
- The generated ontology contains concepts that are
not interesting and should be removed. - Methods used to remove uninteresting nodes are
- Using a rule based method according to the
following condition 6 - Nodes without a domain node are removed
- Intermediate nodes with the following properties
are removed - Nodes without siblings
- Its not the root of any concept
- Conditions which are held in the ontology
- Using IR techniques12
- Considering term frequencies, comparing the
frequency of the current term with the frequency
in a generic corpus, and removing the term if its
frequency in the domain is lower than that of the
term in a generic corpus
26Conclusion
- A progress in building ontologies with web-pages
rather than static texts as their instances is
seen. - There is not a clear and defined way to evaluate
automatically built ontologies and these
ontologies are compared with hand-crafted ones. - The above fact hampers the comparison between two
semi-automatically built ontologies
27References
- Sabou, M., Wroe, C., Goble, C., and Mishne, G.
Learning domain ontologies for Web service
descriptions an experiment in bioinformatics. In
Proceedings of the 14th international Conference
on World Wide Web (Chiba, Japan, May 10 - 14,
2005). WWW '05. ACM Press, New York, NY, 2005. - van Hage, W. R., de Rijke, M., Marx M.,
Information Retrieval Support for Ontology
Construction and Use. In Proceedings of the 3rd
International Semantic Web Conference, Jan 2004,
Pages 518 533, LNCS, Springer 2004. - Quan, T. T. , Hu,i S. C., Fong, A.C.M., Cao, T.
H. Automatic Generation of Ontology for Scholarly
Semantic Web. In Proceedings of the 3rd
International Semantic Web Conference, Jan 2004,
Pages 726 740, LNCS, Springer 2004. - Sabou, M., Wroe, C., Goble, C., and Mishne, G.
Learning domain ontologies for Web service
descriptions an experiment in bioinformatics. In
Proceedings of the 14th international Conference
on World Wide Web (Chiba, Japan, May 10 - 14,
2005). WWW '05. ACM Press, New York, NY, 2005. - Berendt, B., Hotho, A., and Stumme, G. Towards
semantic web mining. In I. Horrocks and J.
Hendler (Eds.), The Semantic Web - ISWC 2002. In
Proceedings of the 1st International Semantic Web
Conference, June 9-12th, 2002, Sardinia, Italy,
pages 264--278. LNCS, Heidelberg, Germany
Springer, 2002.
28References
- Navigli, R. and Velardi, P. Learning Domain
Ontologies from Document Warehouses and Dedicated
Web Sites. In Computational Linguistics, Volume
30, Issue 2. June 2004. - Sanchez, D. and Moreno, A. Web Mining Techniques
for Automatic Discovery of Medical Knowledge. In
Proceedings of Artificial Intelligence in
Medicine, 10th Conference on Artificial
Intelligence in Medicine, AIME 2005, Aberdeen,
UK, July 23-27, 2005. - Noy, N. F., and McGuinness, D. L. . Ontology
Development 101 A Guide to Creating Your First
Ontology. Knowledge Systems Laboratory, March,
2001. - Kavalec, M., Svatek, V. Information Extraction
and Ontology Learning Guided by Web Directory. In
ECAI Workshop on NLP and ML for ontology
engineering, Lyon 2002. - Maedche, A. and Staab, S. 2000. Mining Ontologies
from Text. In Proceedings of the 12th European
Workshop on Knowledge Acquisition, Modeling and
Management. Pages 189-202, LNCS, vol. 1937.
Springer, London, 2000. - Schmid, H. Probabilistic part-of-speech tagging
using decision trees. In International Conference
on New Methods in Language Processing, pages
44--49, Manchester, UK, 1994.
29References
- Maedche, A. and Staab, S. 2001. Ontology Learning
for the Semantic Web. IEEE Intelligent Systems
16, 2, Mar. 2001. - Faure, D. and N'edellec, C. ASIUM Learning
subcategorization frames and restrictions of
selection. In the 10th Conference on Machine
Learning (ECML 98) -- Workshop on Text Mining,
Chemnitz, Germany, April 1998. - Hearst, M. Automatic Acquisition of Hyponyms from
Large Text Corpora. In Proceedings of the 14th
International Conference on Computational
Linguistics, Nantes, France, 1992.