Semantics - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Semantics

Description:

Ontology is a conceptualization of domain into human understandable but machine ... extracted information is used to improve the ontology ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 29
Provided by: sfu5
Category:

less

Transcript and Presenter's Notes

Title: Semantics


1
Semantics data mining document processing
  • Nima Kaviani
  • School of Interactive Arts and Technology
  • Simon Fraser University - SURREY

2
Towards Semantic Web Mining5
  • The idea is to combine two fast-developing
    research areas, Semantic Web, and Web Mining.
  • Semantic Web can be used to improve the results
    of web mining by exploiting new semantic
    structures in the web
  • Web mining is useful to enhance the concepts and
    instances by learning the definition of
    structures for knowledge organization and to
    provide the population of such knowledge
    organization

3
Web Mining
  • Definition the application of data mining
    techniques to the content, structure and usage of
    web resources
  • Web Content Mining a form of text mining to
    extract data from content of the web page
  • Web Structure Mining extracts information reside
    in the structure of hypertext (the idea behind
    links and also the usage for page rankings)
  • Web Usage Mining the web resource that is being
    mined is the record of the requests made by the
    user to capture the user behaviors

4
Semantic Web
  • Definition to add semantic annotation to web
    documents so that they can be easily understand
    by human and read by machines for further
    inferences
  • Ontology Learning semi-automatic extraction of
    semantics from the web to create an ontology.
  • Mapping and Merging Ontologies to merge
    different ontologies and build a new domain
    specific ontology (described by Davis)
  • Instance Learning automatic or semi-automatic
    methods to extract information from web-related
    documents, either to help in annotating new
    documents or to extract additional information
    from existing unstructured or partially
    structured documents.

5
Creating an Ontology
  • Ontology is a conceptualization of domain into
    human understandable but machine readable
    formats. A quadruple of entities, attributes,
    relationships and axioms. 3
  • Steps in creating an ontology for the data 8
  • determining the scope of the ontology
  • reusing existing Ontologies
  • enumerating all the concepts needed
  • defining the taxonomy
  • defining the properties
  • defining facets of the concepts
  • defining instances

Are normally Performed by Ontology Engineer
Can be performed semi-automatically
6
Ontology Learning
  • Why do we try to make the ontology learning
    automatic (semi-automatic)?
  • The source data is usually stored as
    unstructured, semi-structured (HTML, XML) or
    structured (Data Bases) format and should be
    processed in order to be used in creating the
    ontology3.
  • Laborious and cumbersome task
  • Time consuming
  • Dynamic nature of available domains
  • Lack of tools and guidelines 1

7
Semi-Automatic Ontology Learning
  • It aims to integrate multitude of disciplines in
    order to facilitate the construction of
    Ontologies12. because of tacit information
    available, human intervention is always required
    5.
  • Steps in Building an ontology automatically
  • Acquisition of concepts
  • Establishment of concept taxonomies
  • Discovering of non-taxonomic conceptual relations
  • Pruning the generated ontology

8
Acquisition of concepts and establishing
taxonomic relations
  • Using IR and NLP techniques, concepts can be
    extracted quite efficiently.
  • Techniques are normally a combination of methods
    below with a tendency to consider one of them
    more effectively.
  • Computational Linguistic
  • Information Retrieval

9
Term Definitions
  • Recall
  • Fraction of known relevant documents which were
    effectively retrieved.
  • Precision
  • Fraction of retrieved documents which are known
    to be relevant
  • TFIDF
  • The tf-idf weight (term frequency - inverse
    document frequency) is a weight often used in
    Information Retrieval

10
Methods used in acquiring the concepts-1
  • Computational linguistic approach
  • Pre-processing the text to extract dependencies
    and single-word nouns
  • POS-tagger 11 (part-of-speech dependency
    parser) 2, 4
  • Minipar (State-of-the-art dependency) 1
  • Extracting multi-word noun phrases 2
  • Shallow parse the text
  • Filter out word phrases with interesting POS-tag
    patterns
  • Decide for each phrase whether it is a noun
    phrases
  • Extracting taxonomical relations14
  • Uses regular expressions to find ISA relations
  • Defining regular expression relations like
  • NP , NP , or other NP
  • Bruises, wounds, broken bones, or other injuries
  • The overall process is a combination of the tools
    below12
  • Tokenizer Regular expressions to find nouns
  • Lexicon as a big repository for stems
  • Lexical analyzer mixes results from the two
    methods above and extracts new concepts
  • Chunk parse works on phrases to generate
    syntactic dependency relations-uses POS-tagger.
  • Heuristics includes correlations beside
    linguistic-base dependency relations.

11
Methods used in acquiring the concepts-2
  • An information retrieval method using term
    weightings 12
  • Counting relevant terms and extract the more
    frequent ones as concepts
  • lefl,d the frequency of appearance of term l in
    the document d
  • dfl the number of documents in the corpus D that
    term l occurs in
  • cfl the total number of occurrences of term l
    in the corpus D
  • Methods to find the taxonomy
  • Clustering (starts from scratch and uses
    distributional data about words)
  • Classification (uses an available hierarchy and
    refines it)
  • Lexico-Syntactic (regular expressions)

12
Methods used in acquiring the concepts-3
  • Combines information extraction with Ontologies
    and bootstraps9
  • ontology is used to improve the quality of
    extraction
  • extracted information is used to improve the
    ontology
  • the idea is to use indicative terms to find
    informative terms and then to use informative
    terms to find new indicators
  • it is trying to extract a pattern to make
    indicators and informative concepts relevant

13
Methods used in acquiring the concepts-4
  • Specific purpose concept and taxonomy extraction
    7
  • Methodology neighborhood of initial keywords
  • The anterior word of a word classifies it (in
    English)
  • The posterior word of a word represents the
    domain (in English)
  • coronary heart disease
  • Sends the query to the search engine and extracts
    anterior and posterior words of a word and
    decides on if the word is an instance or
    subclass.
  • Clustering is performed according to the
    coincidence amount
  • Synonymy is satisfied by using constraints and
    omitting the initial word

14
Current Status
  • Results
  • IR and Computational Linguistic can solve the
    problem
  • Current methods are trying to derive concepts and
    form taxonomical relations using the biggest
    available corpus, World Wide Web.
  • Problems to be solved
  • Current efforts are mostly using hand-crafted
    concept hierarchies
  • Hardly can find synonyms for a set of available
    concepts.
  • Hardly can make the process of discovering
    synonyms automatic using currently found synonyms

15
Establishment of non-taxonomic relations between
concepts
  • The most important and challenging task in
    building an ontology.
  • Finding data concepts and taxonomic relations are
    simpler in comparison to construct non-taxonomic
    relation between concepts.
  • These approaches are generally a combination of
    Natural Language Processing and Machine Learning

16
Methods proposed to establish relations-1
  • Clustering 13
  • ASIUM a software designed based on unsupervised
    clustering method
  • Does not require any annotation of texts by hand
  • Learns knowledge in the form of
  • Subcategorization frames
  • ltto travelgt ltsubject humangt ltby vehiclegt
  • subject is the syntactic role
  • by is the proposition
  • human and vehicle are restrictions of their
    selection
  • Ontologies

17
Methods proposed to establish relations-1
  • Pre-Processing the text
  • SYLEX provides training text which is attachment
    of verbs to noun phrases and clauses.
  • The first step is done by getting the training
    text as input and generating instantiated
    Subcategorization frames as output.
  • ltverbgt
  • ltsubjectgt
  • ltobjectgt

18
Methods proposed to establish relations-1
  • Clustering Algorithm
  • Factorizing similar instantiated
    subcategorization frames
  • Clustering algorithm used in ASIUM
  • Links represent generality relations
  • Breadth-First
  • Bottom-up clustering
  • Two classes are aggregated
  • Distance is defined as the portion of common head
    words in the two clusters taking into account
    their frequencies
  • Clusters with a distance less than the threshold
    are aggregated
  • The threshold doesnt change in different levels
  • Available clusters, only in the same level, are
    taken into account

19
Methods proposed to establish relations-1
  • card(c1) and card(c2) the number of different
    head words in cluster C1 and C2
  • Ncomm the number of different common head words
    between C1 and C2
  • is the sum of the frequencies of the
    head words of Cj
  • wordiCj is the i-th head word of cluster Cj
  • f(wordiCj) is its frequency
  • minimizes the influences of word
    frequencies

20
Methods proposed to establish relations-1
  • This generality results in change of instantiated
    Subcategorization frames into Subcategorization
    frames
  • Cooperation of user in the process of building
    the ontology is required
  • User labels the clusters
  • User validates the new clusters
  • Rejects those words that restrict the given verbs
  • Partitions new clusters into sub-clusters which
    would not have been identified before
  • Clusters in each level must get validated before
    proceeding to the next level
  • User can partition the clusters and label
    sub-concepts if he find the newly generated
    classes useless or meaningless

21
Example
father
neighbor
father
mother
Passenger
Subjects
verbs
drive
travel
proposition
using
by
by
Objects
train
motorbike
car
car
bicycle
factorizing
car, train, motorbike
car, bicycle
car, train, motorbike, bicycle
clustering
Motorized vehicle
22
Methods proposed to establish relations-2
  • Generalized association rules 10
  • A set of transactions are defined
  • Each transaction consists of a set of items where
    each item is from a set of concepts
  • Two factors are considered in estimating amount
    of relevancy of two different concepts Xk and Yk
    in an association rule
  • Support percentage of transactions that contain
    Xk and Yk as a subset
  • Confidence percentage of transactions that Yk is
    seen when Xk appears in a transaction
  • Some changes have been applied to the basic
    association rule algorithm to make it suitable
    for associations at the right level of the
    taxonomy

23
Methods proposed to establish relations-3
  • Fuzzy Formal Concept Analysis (FFCA)3
  • FCA is based on lattice theory and is used for
    conceptual knowledge discovery
  • Hierarchical relationship of concepts is
    organized as a lattice rather than a tree
  • The method uses a citation database to generate
    concepts
  • Steps in generating ontology using this method
    are
  • FFCA
  • Concept Clustering
  • Ontology generation

24
Current Status
  • Results
  • Methods proposed have reduced the amount of
    effort by a human engineer
  • Problems to be solved
  • They all consider a single-layer generalization,
    however, in many case a multi-layer
    generalization would result in a better hierarchy
  • Still human plays a key role in designing the
    ontology and the quality of the design depends on
    his works

25
Pruning the generated hierarchy
  • The generated ontology contains concepts that are
    not interesting and should be removed.
  • Methods used to remove uninteresting nodes are
  • Using a rule based method according to the
    following condition 6
  • Nodes without a domain node are removed
  • Intermediate nodes with the following properties
    are removed
  • Nodes without siblings
  • Its not the root of any concept
  • Conditions which are held in the ontology
  • Using IR techniques12
  • Considering term frequencies, comparing the
    frequency of the current term with the frequency
    in a generic corpus, and removing the term if its
    frequency in the domain is lower than that of the
    term in a generic corpus

26
Conclusion
  • A progress in building ontologies with web-pages
    rather than static texts as their instances is
    seen.
  • There is not a clear and defined way to evaluate
    automatically built ontologies and these
    ontologies are compared with hand-crafted ones.
  • The above fact hampers the comparison between two
    semi-automatically built ontologies

27
References
  • Sabou, M., Wroe, C., Goble, C., and Mishne, G.
    Learning domain ontologies for Web service
    descriptions an experiment in bioinformatics. In
    Proceedings of the 14th international Conference
    on World Wide Web (Chiba, Japan, May 10 - 14,
    2005). WWW '05. ACM Press, New York, NY, 2005.
  • van Hage, W. R., de Rijke, M., Marx M.,
    Information Retrieval Support for Ontology
    Construction and Use. In Proceedings of the 3rd
    International Semantic Web Conference, Jan 2004,
    Pages 518 533, LNCS, Springer 2004.
  • Quan, T. T. , Hu,i S. C., Fong, A.C.M., Cao, T.
    H. Automatic Generation of Ontology for Scholarly
    Semantic Web. In Proceedings of the 3rd
    International Semantic Web Conference, Jan 2004,
    Pages 726 740, LNCS, Springer 2004.
  • Sabou, M., Wroe, C., Goble, C., and Mishne, G.
    Learning domain ontologies for Web service
    descriptions an experiment in bioinformatics. In
    Proceedings of the 14th international Conference
    on World Wide Web (Chiba, Japan, May 10 - 14,
    2005). WWW '05. ACM Press, New York, NY, 2005.
  • Berendt, B., Hotho, A., and Stumme, G. Towards
    semantic web mining. In I. Horrocks and J.
    Hendler (Eds.), The Semantic Web - ISWC 2002. In
    Proceedings of the 1st International Semantic Web
    Conference, June 9-12th, 2002, Sardinia, Italy,
    pages 264--278. LNCS, Heidelberg, Germany
    Springer, 2002.

28
References
  • Navigli, R. and Velardi, P. Learning Domain
    Ontologies from Document Warehouses and Dedicated
    Web Sites. In Computational Linguistics, Volume
    30, Issue 2. June 2004.
  • Sanchez, D. and Moreno, A. Web Mining Techniques
    for Automatic Discovery of Medical Knowledge. In
    Proceedings of Artificial Intelligence in
    Medicine, 10th Conference on Artificial
    Intelligence in Medicine, AIME 2005, Aberdeen,
    UK, July 23-27, 2005.
  • Noy, N. F., and McGuinness, D. L. . Ontology
    Development 101 A Guide to Creating Your First
    Ontology. Knowledge Systems Laboratory, March,
    2001.
  • Kavalec, M., Svatek, V. Information Extraction
    and Ontology Learning Guided by Web Directory. In
    ECAI Workshop on NLP and ML for ontology
    engineering, Lyon 2002.
  • Maedche, A. and Staab, S. 2000. Mining Ontologies
    from Text. In Proceedings of the 12th European
    Workshop on Knowledge Acquisition, Modeling and
    Management. Pages 189-202, LNCS, vol. 1937.
    Springer, London, 2000.
  • Schmid, H. Probabilistic part-of-speech tagging
    using decision trees. In International Conference
    on New Methods in Language Processing, pages
    44--49, Manchester, UK, 1994.

29
References
  • Maedche, A. and Staab, S. 2001. Ontology Learning
    for the Semantic Web. IEEE Intelligent Systems
    16, 2, Mar. 2001.
  • Faure, D. and N'edellec, C. ASIUM Learning
    subcategorization frames and restrictions of
    selection. In the 10th Conference on Machine
    Learning (ECML 98) -- Workshop on Text Mining,
    Chemnitz, Germany, April 1998.
  • Hearst, M. Automatic Acquisition of Hyponyms from
    Large Text Corpora. In Proceedings of the 14th
    International Conference on Computational
    Linguistics, Nantes, France, 1992.
Write a Comment
User Comments (0)
About PowerShow.com