Title: Discovering synonyms
1Discovering synonyms
- Discovering synonyms with use of frequent
itemsets and association rules
Author Adam Jakubowski Promoter prof. Henryk
Rybinski
2Presentation plan
- Definition of problem
- Basic terminology
- Synonyms extraction process
- Experiments results
- Conclusions
3Problem definition
- Find synonyms or words that are likely to be
synonyms in text corpus written in natural
language - The method should be quite simple and include at
most shallow parsing of documents.
4Basic terminology
- Definition of synonymy
- Database used for data mining
- Frequent itemsets and support
- Association rules and confidence
5Definition of synonyms
- Synonyms are different words with similar or
identical meanings and are interchangeable. - Synonyms can be nouns, adverbs or adjectives, as
long as both members of the pair are the same
part of speech. - Wikipedia
6Assumption
- We assume that words that appear in text in
common context, but they do not co-occur commonly
together can be synonyms. - That means synonyms are often used with the same
words.
7Definition of data set used for DM
- Let denote as I i1, i2, ..., im a set of
distinct items i1,i2 to im. - Then dataset D is a set of transactions (or
records), where each transaction is a subset
of I.
8Frequent itemsets and support
- Itemset X is any subset of I.
- Support of an itemset X, denoted by sup(X), is
the number of transactions in D that contain all
items in X. - An itemset X is defined frequent, iff sup(X) gt
minSup, where minSup is the user-defined
threshold value.
9Association rules and confidence
- An association rule is denoted as X-gtY
- Where X,Y are itemset.
- Association rules states that if in transaction
occurs items from X then it is likely that also
items form Y occurs if confidence of this rule is
high. - Association rule confidence
10Synonyms extraction process
- Text corpus pre-processing
- Converting text corpus into database containing
transactions - Finding frequent itemsets
- Generation of words context
- Computing synonymy measure
11Text corpora pre-processing
- Documents are divided into paragraphs or
sentences - Removing stopwords
- Tagging with parts of speech
- Filtering parts of speech
12Converting text corpus into transactional database
- Words used in corpus are translated into items in
database - Words whose support is less than minSupp are
removed - Processing units (sentences/ paragraphs) are
converted into transactions in database - Conversion allows further use of data mining
algorithms on text corpus
13Finding frequent itemsets
- Frequent itemsets are mined with Apriori like
algorithm. - Use of modified data structures that are more
efficient than the hash tree applied by Agrawal - Results stored in specially adjusted lucene index
14Generation of words context
- For each word it context is computed .
- Context of word x is all frequent itemsets that
contains x (supersets of x) and we will refer
to it as context(X). - Formaly
15Computing synonymy measure 1/3
- Each pair of frequent words X,Y that are assigned
the same part of speech is checked against
synonymy. - If itemsets X,Y is frequent then pair X,Y is no
longer suspected to be synonyms. - Synonymy measure consists of three measures SM1,
SM2 and SM3
16Computing synonymy measure 2/3
- SM1 is computed as number of rules with
predecessor x or y and common antecedent. - SM2 is computed by dividing SM1 by the size of
larger context of the words X and Y.
17Computing synonymy measure 3/3
- SM3 for words X, Y is computed with use of
association rules but contexts used in
predecessors. - This measure attempt to check if modified X and
Y with a context A have again similar contexts
18Experiments results for Reuters 1/3
- Chosen router documents
- Number of documents 200
- Number of sentences 21198
- Minimum support 0.064
19Experiments results for Reuters 2/3
Word Word SM3 SM2 SM1
technology problem 0.8 0.9276509087484697 4
chairman technology 0.8 0.9183573114532989 4
chairman problem 0.8 0.8690496948561465 4
industry time 0.75 0.9269811007086963 21
spokesman official 0.7391304347826086 0.9254639056002771 17
Korea Africa 0.75 0.8651219394640447 3
analyst economist 0.6666666666666666 0.9663239620791109 4
20Experiments results for Reuters 3/3
Word Word SM3 SM2 SM1
economy investment 0.7647058823529411 0.8405541988875322 13
country world 0.6666666666666666 0.9462921022067363 16
analyst chairman 0.8 0.7878777589134127 4
analyst technology 0.8 0.7819081239684578 4
companies economy 0.65 0.9541963436395156 13
companies production 0.6190476190476191 0.9895503992995227 13
economy country 0.625 0.9544173505107411 15
industry business 0.6206896551724138 0.958468762215417 18
21Experiments results for Ontology documents 1/3
- Scientific articles about ontologies
- Number of documents 87
- Number of sentences 52941
- Minimum support 0.1
22Experiments results for Ontology documents 2/3
word1 word2 SM3 SM2 SM1
members PMS 1.0 0.991888549396188 3
representations projects 1.0 0.9906576141847154 4
members advantage 1.0 0.9881207478329057 3
advantage PMS 1.0 0.9800092972290937 3
comparison effect 1.0 0.9789939704695438 3
details effect 1.0 0.9771241830065359 3
others Problem 1.0 0.9715776584936846 5
23Experiments results for Ontology documents 3/3
word1 word2 SM3 SM2 SM1
space strategies 1.0 0.971282011602115 5
members effect 1.0 0.9687267311988086 3
comparison details 1.0 0.9653618509550712 3
version usage 1.0 0.9638703031289708 4
selection output 1.0 0.9636859939759037 4
effect advantage 1.0 0.9623264134734298 3
functions projects 1.0 0.9622140962455208 4
24Conclusions 1/2
- Problem with evaluating results of our
experiments. - Method finds not only synonyms
25Conclusions 2/2
- If any two terms that do not co-occur have a
similar contexts the following cases may hold - The terms are synonyms
- The terms are related by broader narrower
relation (trade busines) - One term is instance of category which is defined
by another (country Japan) - Two terms are instances of the same category
(Tuesday, Wednesday) - Terms are associated (implementation, task )
- Terms are not related or relation is hard to
tell. (industry time)