Discovering synonyms - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Discovering synonyms

Description:

Adam Jakubowski. 4. Basic terminology. Definition of synonymy. Database used for data mining ... Adam Jakubowski. 8. Frequent itemsets and support. Itemset X is ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 26

Provided by: adamjak

Category:

more less

Transcript and Presenter's Notes

Title: Discovering synonyms

1
Discovering synonyms

Discovering synonyms with use of frequent
itemsets and association rules

Author Adam Jakubowski Promoter prof. Henryk
Rybinski
2
Presentation plan

Definition of problem
Basic terminology
Synonyms extraction process
Experiments results
Conclusions

3
Problem definition

Find synonyms or words that are likely to be
synonyms in text corpus written in natural
language
The method should be quite simple and include at
most shallow parsing of documents.

4
Basic terminology

Definition of synonymy
Database used for data mining
Frequent itemsets and support
Association rules and confidence

5
Definition of synonyms

Synonyms are different words with similar or
identical meanings and are interchangeable.
Synonyms can be nouns, adverbs or adjectives, as
long as both members of the pair are the same
part of speech.
Wikipedia

6
Assumption

We assume that words that appear in text in
common context, but they do not co-occur commonly
together can be synonyms.
That means synonyms are often used with the same
words.

7
Definition of data set used for DM

Let denote as I i1, i2, ..., im a set of
distinct items i1,i2 to im.
Then dataset D is a set of transactions (or
records), where each transaction is a subset
of I.

8
Frequent itemsets and support

Itemset X is any subset of I.
Support of an itemset X, denoted by sup(X), is
the number of transactions in D that contain all
items in X.
An itemset X is defined frequent, iff sup(X) gt
minSup, where minSup is the user-defined
threshold value.

9
Association rules and confidence

An association rule is denoted as X-gtY
Where X,Y are itemset.
Association rules states that if in transaction
occurs items from X then it is likely that also
items form Y occurs if confidence of this rule is
high.
Association rule confidence

10
Synonyms extraction process

Text corpus pre-processing
Converting text corpus into database containing
transactions
Finding frequent itemsets
Generation of words context
Computing synonymy measure

11
Text corpora pre-processing

Documents are divided into paragraphs or
sentences
Removing stopwords
Tagging with parts of speech
Filtering parts of speech

12
Converting text corpus into transactional database

Words used in corpus are translated into items in
database
Words whose support is less than minSupp are
removed
Processing units (sentences/ paragraphs) are
converted into transactions in database
Conversion allows further use of data mining
algorithms on text corpus

13
Finding frequent itemsets

Frequent itemsets are mined with Apriori like
algorithm.
Use of modified data structures that are more
efficient than the hash tree applied by Agrawal
Results stored in specially adjusted lucene index

14
Generation of words context

For each word it context is computed .
Context of word x is all frequent itemsets that
contains x (supersets of x) and we will refer
to it as context(X).
Formaly

15
Computing synonymy measure 1/3

Each pair of frequent words X,Y that are assigned
the same part of speech is checked against
synonymy.
If itemsets X,Y is frequent then pair X,Y is no
longer suspected to be synonyms.
Synonymy measure consists of three measures SM1,
SM2 and SM3

16
Computing synonymy measure 2/3

SM1 is computed as number of rules with
predecessor x or y and common antecedent.
SM2 is computed by dividing SM1 by the size of
larger context of the words X and Y.

17
Computing synonymy measure 3/3

SM3 for words X, Y is computed with use of
association rules but contexts used in
predecessors.
This measure attempt to check if modified X and
Y with a context A have again similar contexts

18
Experiments results for Reuters 1/3

Chosen router documents
Number of documents 200
Number of sentences 21198
Minimum support 0.064

19
Experiments results for Reuters 2/3
Word Word SM3 SM2 SM1
technology problem 0.8 0.9276509087484697 4
chairman technology 0.8 0.9183573114532989 4
chairman problem 0.8 0.8690496948561465 4
industry time 0.75 0.9269811007086963 21
spokesman official 0.7391304347826086 0.9254639056002771 17
Korea Africa 0.75 0.8651219394640447 3
analyst economist 0.6666666666666666 0.9663239620791109 4
20
Experiments results for Reuters 3/3
Word Word SM3 SM2 SM1
economy investment 0.7647058823529411 0.8405541988875322 13
country world 0.6666666666666666 0.9462921022067363 16
analyst chairman 0.8 0.7878777589134127 4
analyst technology 0.8 0.7819081239684578 4
companies economy 0.65 0.9541963436395156 13
companies production 0.6190476190476191 0.9895503992995227 13
economy country 0.625 0.9544173505107411 15
industry business 0.6206896551724138 0.958468762215417 18
21
Experiments results for Ontology documents 1/3

Scientific articles about ontologies
Number of documents 87
Number of sentences 52941
Minimum support 0.1

22
Experiments results for Ontology documents 2/3
word1 word2 SM3 SM2 SM1
members PMS 1.0 0.991888549396188 3
representations projects 1.0 0.9906576141847154 4
members advantage 1.0 0.9881207478329057 3
advantage PMS 1.0 0.9800092972290937 3
comparison effect 1.0 0.9789939704695438 3
details effect 1.0 0.9771241830065359 3
others Problem 1.0 0.9715776584936846 5
23
Experiments results for Ontology documents 3/3
word1 word2 SM3 SM2 SM1
space strategies 1.0 0.971282011602115 5
members effect 1.0 0.9687267311988086 3
comparison details 1.0 0.9653618509550712 3
version usage 1.0 0.9638703031289708 4
selection output 1.0 0.9636859939759037 4
effect advantage 1.0 0.9623264134734298 3
functions projects 1.0 0.9622140962455208 4
24
Conclusions 1/2

Problem with evaluating results of our
experiments.
Method finds not only synonyms

25
Conclusions 2/2

If any two terms that do not co-occur have a
similar contexts the following cases may hold
The terms are synonyms
The terms are related by broader narrower
relation (trade busines)
One term is instance of category which is defined
by another (country Japan)
Two terms are instances of the same category
(Tuesday, Wednesday)
Terms are associated (implementation, task )
Terms are not related or relation is hard to
tell. (industry time)