Title: Information Retrieval from Data Bases for Decisions
1Information Retrieval from Data Bases for
Decisions
- Dr. Gábor SZUCS, Ph.D.
- Assistant professor
- BUTE, Department Information and Knowledge
Management
2Contents
- Aims
- General steps in the procedure
- Market basket analysis
- Frequent itemsets
- Conclusion
3Aims
- search hidden coherences in the existing data
bases (DB) - help to take a well grounded decision
- Data mining techniques are able to find such
relationships. - they provide the ability to optimize
decision-making - they are the most powerful tools for retrieval
important information
4Steps of the data mining
- Declaration of the key and the predictor
variables in order to analyse
(Sampling from a large amount of data) - Modification of variables, where we should
examine whether some variables should be
integrated (in large DBs always occur some
mistakes)
(some transformations should be executed)
5Additional steps of the data mining
- Modelling, data mining techniques neural
network, decision tree, regression procedures,
cluster analysis, factor analysis, discriminant
analysis, etc. - Comparison the data mining models built on the
same DB (the best model can be selected).
The
procedure can be cyclically repeated. After the
whole procedure the hidden relationships between
different aspects can be shown.
6Market Basket Analysis
- is used for finding groups of items that tend to
occur together. - The models give the likelihood of different
products being purchased together. - Market basket analysis is useful for
- items occur together
- items occur in a particular sequence
7Table of Co-Occurrence of Products
Product 1 Product 2 Product 3 Product 4 Product 5
Product 1 234 12 0 125 54
Product 2 12 175 65 23 75
Product 3 0 65 229 67 62
Product 4 125 23 67 315 55
Product 5 54 75 62 55 292
8Procedure of the market basket analysis
- Choose the right level of the product hierarchy
for the items. - Probabilities and joint probabilities of the
items are calculated. - Determine the association rules.
9Example
Bicycle (A) 140
Hand tools for bicycle (B) 100
Tool rack (C) 61
Bicycle and hand tool (A B) 50
Bicycle and tool rack (A C) 7
Hand tool and tool rack (B C) 45
Bicycle and hand tool and tool rack (A B C) 5
10Table of probabilities and joint probabilities of
items
A 14
B 10
C 6,1
A B 5
A C 0,7
B C 4,5
A B C 0,5
11Association rules
- The rules (A?B) consist of two parts
- condition and
- consequence
- A confidence can be defined for the rules
12Example
- P(A?B) 5 / 14 0.357
- P((AB)?C) 0.05 / 0.5 0.1
- P((AC)?B) 0.05 / 0.07 0.714
- P((BC)?A) 0.05 / 0.45 0.111
- Is this association rule can help us?
- If we offer product A for everybody,
then 14 of the persons will purchase. - If A for only B and C,
then 11 of the people will purchase.
13Improvement
- This will help us to decide that the association
rule is useful or not.
14In our example
- Improvement ((BC)?A) 0.111 / 0.14 0.794
- Improvement ((AB)?C) 0.1 / 0.061 1.639
- The value of improvement shows the usefulness of
the analysis - improvement gt 1
- improvement lt 1
15Dissociation rules
- similar to association rules
- count the inverse of the original item, ?
- modify each transaction
- A transaction includes an inverse item if, and
only if, it does not contain the original item.
16Time series
- the transactions must have two additional
features - time information (e.g. time sequence or time
stamp) - identifying information (e.g. customer id,
account number in a bank)
17Frequent itemsets
- appear in at least fixed ratio
- problem
- a-priori trick
- If a set of items S is frequent, then every
subset of S is also frequent. - procedure built from lower level to upper level
(frequent items, frequent pairs, etc.)
18A-Priori Algorithm
- Define a threshold for relative frequency. All
items are examined.
The set of the frequent items L1. - Pairs of items in L1 become the candidate (C2).
- This is compared with the threshold limit. L2
contains the frequent pairs.
19A-Priori Algorithm (cont.)
- The candidate triples (C3) are those sets A,B,C
such that all of subset are in L2. L3 will
contain the frequent triples. - Li is the frequent sets of size i,
Ci1 is the candidate set of size i1 - until the sets become empty
20Criticism of A-Priori Algorithm
- good if we would like to know only the frequent
pairs - at searhing maximal frequent itemsets too
many steps may be needed - physical capacity of computers
21Market Basket Mining with High Correlation
Analysis
- The data are organised in a matrix.
- The cells contain Boolean.
- 1 yes
- 0 no
- This matrix is very sparse.
- We want to find the highly correlated pairs.
22Applications of High Correlation Mining
- Rows are the document, columns are the words. The
highly correlated pairs of columns will give the
words that appear almost together. - Rows and columns are Web pages. The cell contains
1, if the page of row links to the page of
column. Result pages about the same topic. - Page of columns links to the page of row. Result
the mirror pages.
23Conclusion
- Planning store layout
- Bundling products
- Offering coupons
24Future
- Further development
- hierarchical association rules
- association rules maintenance
- sequential pattern mining
- functional dependency mining
25Thank you!
- The flow is open for the discussion.
- E-mail szucs_at_itm.bme.hu
26References
- Fajszi Bulcsú, Cser László Üzleti tudás az
adatok mélyén Adatbányászat alkalmazói szemmel,
Budapest, 2004, Budapesti Muszaki és
Gazdaságtudományi Egyetem, Információ- és
Tudásmenedzsment Tanszék. - Michael J. A. Berry, Gordon Linoff Data Mining
Techniques For Marketing, Sales, and Customor
Support, Canada, 1997, John Wiley Sons, Inc. - Sam Kash Kachigan Multivariate Statistical
Analysis, New York, 1991, Radius Press. - Ferenc Bodon A fast APRIORI implementation.
- Agrawal, R., Srikant, R Fast algorithms for
mining association rules, The International
Conference on Very Large Databases, 1994, pages
487-499.