Discovering synonyms - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Discovering synonyms

Description:

Adam Jakubowski. 4. Basic terminology. Definition of synonymy. Database used for data mining ... Adam Jakubowski. 8. Frequent itemsets and support. Itemset X is ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 26
Provided by: adamjak
Category:

less

Transcript and Presenter's Notes

Title: Discovering synonyms


1
Discovering synonyms
  • Discovering synonyms with use of frequent
    itemsets and association rules

Author Adam Jakubowski Promoter prof. Henryk
Rybinski
2
Presentation plan
  • Definition of problem
  • Basic terminology
  • Synonyms extraction process
  • Experiments results
  • Conclusions

3
Problem definition
  • Find synonyms or words that are likely to be
    synonyms in text corpus written in natural
    language
  • The method should be quite simple and include at
    most shallow parsing of documents.

4
Basic terminology
  • Definition of synonymy
  • Database used for data mining
  • Frequent itemsets and support
  • Association rules and confidence

5
Definition of synonyms
  • Synonyms are different words with similar or
    identical meanings and are interchangeable.
  • Synonyms can be nouns, adverbs or adjectives, as
    long as both members of the pair are the same
    part of speech.
  • Wikipedia

6
Assumption
  • We assume that words that appear in text in
    common context, but they do not co-occur commonly
    together can be synonyms.
  • That means synonyms are often used with the same
    words.

7
Definition of data set used for DM
  • Let denote as I  i1, i2, ..., im a set of
    distinct items i1,i2 to im.
  • Then dataset D is a set of transactions (or
    records), where each transaction is a subset
    of I.

8
Frequent itemsets and support
  • Itemset X is any subset of I.
  • Support of an itemset X, denoted by sup(X), is
    the number of transactions in D that contain all
    items in X.
  • An itemset X is defined frequent, iff sup(X) gt
    minSup, where minSup is the user-defined
    threshold value.

9
Association rules and confidence
  • An association rule is denoted as X-gtY
  • Where X,Y are itemset.
  • Association rules states that if in transaction
    occurs items from X then it is likely that also
    items form Y occurs if confidence of this rule is
    high.
  • Association rule confidence

10
Synonyms extraction process
  1. Text corpus pre-processing
  2. Converting text corpus into database containing
    transactions
  3. Finding frequent itemsets
  4. Generation of words context
  5. Computing synonymy measure

11
Text corpora pre-processing
  • Documents are divided into paragraphs or
    sentences
  • Removing stopwords
  • Tagging with parts of speech
  • Filtering parts of speech

12
Converting text corpus into transactional database
  • Words used in corpus are translated into items in
    database
  • Words whose support is less than minSupp are
    removed
  • Processing units (sentences/ paragraphs) are
    converted into transactions in database
  • Conversion allows further use of data mining
    algorithms on text corpus

13
Finding frequent itemsets
  • Frequent itemsets are mined with Apriori like
    algorithm.
  • Use of modified data structures that are more
    efficient than the hash tree applied by Agrawal
  • Results stored in specially adjusted lucene index

14
Generation of words context
  • For each word it context is computed .
  • Context of word x is all frequent itemsets that
    contains x (supersets of x) and we will refer
    to it as context(X).
  • Formaly

15
Computing synonymy measure 1/3
  • Each pair of frequent words X,Y that are assigned
    the same part of speech is checked against
    synonymy.
  • If itemsets X,Y is frequent then pair X,Y is no
    longer suspected to be synonyms.
  • Synonymy measure consists of three measures SM1,
    SM2 and SM3

16
Computing synonymy measure 2/3
  • SM1 is computed as number of rules with
    predecessor x or y and common antecedent.
  • SM2 is computed by dividing SM1 by the size of
    larger context of the words X and Y.

17
Computing synonymy measure 3/3
  • SM3 for words X, Y is computed with use of
    association rules but contexts used in
    predecessors.
  • This measure attempt to check if modified X and
    Y with a context A have again similar contexts

18
Experiments results for Reuters 1/3
  • Chosen router documents
  • Number of documents 200
  • Number of sentences 21198
  • Minimum support 0.064

19
Experiments results for Reuters 2/3
Word Word SM3 SM2 SM1
technology problem 0.8 0.9276509087484697 4
chairman technology 0.8 0.9183573114532989 4
chairman problem 0.8 0.8690496948561465 4
industry time 0.75 0.9269811007086963 21
spokesman official 0.7391304347826086 0.9254639056002771 17
Korea Africa 0.75 0.8651219394640447 3
analyst economist 0.6666666666666666 0.9663239620791109 4
20
Experiments results for Reuters 3/3
Word Word SM3 SM2 SM1
economy investment 0.7647058823529411 0.8405541988875322 13
country world 0.6666666666666666 0.9462921022067363 16
analyst chairman 0.8 0.7878777589134127 4
analyst technology 0.8 0.7819081239684578 4
companies economy 0.65 0.9541963436395156 13
companies production 0.6190476190476191 0.9895503992995227 13
economy country 0.625 0.9544173505107411 15
industry business 0.6206896551724138 0.958468762215417 18
21
Experiments results for Ontology documents 1/3
  • Scientific articles about ontologies
  • Number of documents 87
  • Number of sentences 52941
  • Minimum support 0.1

22
Experiments results for Ontology documents 2/3
word1 word2 SM3 SM2 SM1
members PMS 1.0 0.991888549396188 3
representations projects 1.0 0.9906576141847154 4
members advantage 1.0 0.9881207478329057 3
advantage PMS 1.0 0.9800092972290937 3
comparison effect 1.0 0.9789939704695438 3
details effect 1.0 0.9771241830065359 3
others Problem 1.0 0.9715776584936846 5
23
Experiments results for Ontology documents 3/3
word1 word2 SM3 SM2 SM1
space strategies 1.0 0.971282011602115 5
members effect 1.0 0.9687267311988086 3
comparison details 1.0 0.9653618509550712 3
version usage 1.0 0.9638703031289708 4
selection output 1.0 0.9636859939759037 4
effect advantage 1.0 0.9623264134734298 3
functions projects 1.0 0.9622140962455208 4
24
Conclusions 1/2
  • Problem with evaluating results of our
    experiments.
  • Method finds not only synonyms

25
Conclusions 2/2
  • If any two terms that do not co-occur have a
    similar contexts the following cases may hold
  • The terms are synonyms
  • The terms are related by broader narrower
    relation (trade busines)
  • One term is instance of category which is defined
    by another (country Japan)
  • Two terms are instances of the same category
    (Tuesday, Wednesday)
  • Terms are associated (implementation, task )
  • Terms are not related or relation is hard to
    tell. (industry time)
Write a Comment
User Comments (0)
About PowerShow.com