Title: Course on Data Mining (581550-4)
1Course on Data Mining (581550-4)
Intro/Ass. Rules
Clustering
Episodes
KDD Process
Text Mining
Appl./Summary
2Course on Data Mining (581550-4)
Today 26.10.2001
- Summary
- Course organization
- Summary
- What is data mining?
- Today's subject
- Association rules
- Next week's program
- Lecture Episodes
- Exercise Associations
- Seminar Associations
3Course Organization
Lectures, Exercises, Exam
- 12 lectures 24.10.-30.11.2001
- Wed 14-16, Fri 12-14 (A217)
- Wed normal lecture
- Fri seminar like lecture (except for 26.10.)
- 5 exercise sessions 1.11.-29.11.2001
- Thu 12-14 (A318)
- Home exam
- Given 28.11., Returned due 21.12.2001
- Language
- Lecturing language is Finnish
- Slides and material are in English
4Course Organization
Group Work
- Group for seminar (and exercise) work
- 10 groups, à 3 persons, 2 groups/lecture
- Dates are agreed at the beginning of course
- Articles are given on previous week's Wed
- Seminar presentations
- Presentation in an HTML page (around 3-5 printed
pages) due to seminar starting - Can be either a HTML page or a printable document
in PostScript/PDF format - 30 minutes of presentation
- 5-15 minutes of discussion
- Active participation
5Course Organization / Groups
- Group presentation time allocation
- Fri 2.11. Group 1, Group 2 (associations)
- Fri 9.11. Group 3, Group 4 (episodes)
- Fri 16.11. Group 5, Group 6 (text mining)
- Fri 23.11. Group 7, Group 8 (clustering)
- Fri 30.11. Group 9, Group 10 (KDD process)
6Course Organization
Course Evaluation
- Passing the course min 30 points
- home exam min 13 points (max 30 points)
- exercises/experiments min 8 points (max 20
points) - at least 3 returned and reported experiments
- group presentation min 4 points (max 10 points)
- Remember also the other requirements
- Attending the lectures (5/7)
- Attending the seminars (4/5)
- Attending the exercises (4/5)
7Course Organization
Course Material
- Lecture slides
- Original articles
- Seminar presentations
- Book "Data Mining Concepts and Techniques" by
Jiawei Han and Micheline Kamber, Morgan Kaufmann
Publishers, August 2000. 550 pages. ISBN
1-55860-489-8 - Remember to check course website and folder for
the material!
8SummaryWhat is Data Mining?
- Ultimately
- "Extraction of interesting (non-trivial,
implicit, previously unknown, potentially useful)
information or patterns from data in large
databases" - Often just
- "Tell something interesting about this data",
"Describe this data" - Exploratory, semi-automatic data analysis on
large data sets
9SummaryWhat is Data Mining?
- Data mining semi-automatic discovery of
interesting patterns from large data sets - Knowledge discovery is a process
- Preprocessing
- Data mining
- Postprocessing
- To be mined, used or utilized different
- Databases (relational, object-oriented, spatial,
WWW, ) - Knowledge (characterization, clustering,
association, ) - Techniques (machine learning, statistics,
visualization, ) - Applications (retail, telecom, Web mining, log
analysis, )
10Summary Typical KDD Process
Operational Database
Data mining
Input data
Results
2
Utilization
11Association Rules
Basics
Examples
Generation
Multi-level Rules
Constraints
12Market Basket Analysis
- Analysis of customer buying habits by finding
associations and correlations between the
different items that customers place in their
"shopping basket"
13Market Basket Analysis
- Given
- A database of customer transactions (e.g.,
shopping baskets), where each transaction is a
set of items (e.g., products) - Find
- Groups of items which are frequently purchased
together
14Market Basket Analysis
- Extract information on purchasing behavior
- "IF buys beer and sausage, THEN also by mustard
with high probability" - Actionable information can suggest...
- New store layouts and product assortments
- Which products to put on promotion
- MBA approach is applicable whenever a customer
purchases multiple things in proximity - Credit cards
- Services of telecommunication companies
- Banking services
- Medical treatments
15Market Basket Analysis
- Useful
- "On Thursdays, grocery store consumers often
purchase diapers and beer together." - Trivial
- "Customers who purchase maintenance agreements
are very likely to purchase large appliances." - Unexplicable/unexpected
- "When a new hardaware store opens, one of the
most sold items is toilet rings."
16Association Rules Basics
- Association rule mining
- Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories. - Comprehensibility Simple to understand
- Utilizability Provide actionable information
- Efficiency Efficient discovery algorithms exist
- Applications
- Market basket data analysis, cross-marketing,
catalog design, loss-leader analysis, clustering,
classification, etc.
17Association Rules Basics
- Typical representation formats for association
rules - diapers ? beer 0.5, 60
- buysdiapers ? buysbeer 0.5, 60
- "IF buys diapers, THEN buys beer in 60 of the
cases. Diapers and beer are bought together in
0.5 of the rows in the database." - Other representations (used in Han's book)
- buys(x, "diapers") ? buys(x, "beer") 0.5, 60
- major(x, "CS") takes(x, "DB") ? grade(x, "A")
1, 75
18Association Rules Basics
"IF buys diapers, THEN buys beer in 60 of the
cases in 0.5 of the rows"
- Antecedent, left-hand side (LHS), body
- Consequent, right-hand side (RHS), head
- Support, frequency ("in how big part of the data
the things in left- and right-hand sides occur
together") - Confidence, strength ("if the left-hand side
occurs, how likely the right-hand side occurs")
19Association Rules Basics
- Support denotes the frequency of the rule
within transactions. - support(A ? B s, c ) p(A?B) support
(A,B) - Confidence denotes the percentage of
transactions containing A which contain also B. - confidence(A ? B s, c ) p(BA) p(A?B) /
p(A) support(A,B) / support(A)
20Association Rules Basics
- Minimum support ?
- High ? few frequent itemsets
- ? few valid rules which occur very often
- Low ? many valid rules which occur rarely
- Minimum confidence ?
- High ? few rules, but all "almost logically
true" - Low ? many rules, many of them very "uncertain"
- Typical values ? 2 -10 , ? 70 - 90
21Association Rules Basics
- Transaction
- Relational format Compact format
- ltTid,itemgt ltTid,itemsetgt
- lt1, item1gt lt1, item1,item2gt
- lt1, item2gt lt2, item3gt
- lt2, item3gt
- Item vs. itemsets single element vs. set of
items - Support of an itemset I of transaction
containing I - Minimum support ? threshold for support
- Frequent itemset with support ? ?.
22Association Rules Basics
- Given (1) database of transactions, (2) each
transaction is a list of items bought (purchased
by a customer in a visit) - Find all rules with minimum support and
confidence
- If min. support 50 and min. confidence 50, then
- A ? C 50, 66.6, C ? A 50, 100
23Association Rule Generation
- Association rule mining is a two-step process
- STEP 1 Find the frequent itemsets the sets of
items that have minimum support. - So called Apriori trick A subset of a frequent
itemset must also be a frequent itemset - i.e., if AB is a frequent itemset, both A and
B should be frequent itemsets - Iteratively find frequent itemsets with size from
1 to k (k-itemset) - STEP 2 Use the frequent itemsets to generate
association rules.
24Frequent Sets with Apriori
- Join Step Ck is generated by joining Lk-1with
itself - Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset - Pseudo-code
- Ck Candidate itemset of size k Lk Frequent
itemset of size k - L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in
Ck1 that are
contained in t - Lk1 candidates in Ck1 with
min_support - end
- return ?k Lk
25Apriori Candidate Generation
- The Apriori principle
- Any subset of a frequent itemset must be frequent
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
26Apriori Example (1/6)
27Apriori Example (2/6)
28Apriori Example (3/6)
29Apriori Example (4/6)
Search Space of Database D
12345
1234 1235 1245 1345 2345
123 124 125 134 135 145 234
235 245 345
12 13 14 15 23 24 25
34 35 45
1 2 3 4 5
30Apriori Example (5/6)
Apriori Trick on Level 1
12345
1234 1235 1245 1345 2345
123 124 125 134 135 145 234
235 245 345
12 13 14 15 23 24 25
34 35 45
1 2 3 4 5
31Apriori Example (6/6)
Apriori Trick on Level 2
12345
1234 1235 1245 1345 2345
123 124 125 134 135 145 234
235 245 345
12 13 14 15 23 24 25
34 35 45
1 2 3 4 5
32Is Apriori Fast Enough?
- The core of the Apriori algorithm
- Use frequent (k 1)-itemsets to generate
candidate frequent k-itemsets - Use database scan and pattern matching to collect
counts for the candidate itemsets - The bottleneck of Apriori candidate generation
- Huge candidate sets
- 104 frequent 1-itemset will generate 107
candidate 2-itemsets - To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates. - Multiple scans of database
- Needs (n 1 ) scans, n is the length of the
longest pattern
33Is Apriori Fast Enough?
- In practice
- For basic Apriori approach the number of
attributes in the row is usually much more
critical than the number of transaction rows - For example
- 50 attributes each having 1-3 values, 100.000
rows (not very bad) - 50 attributes each having 10-100 values, 100.000
rows (quite bad) - 10.000 attributes each having 5-10 values, 100
rows (very bad...) - Notice
- One attribute might have several different values
- Association rule algorithms typically treat every
attribute-value pair as one attribute (2
attribute with 5 values each gt "10 attributes") - There are some ways to overcome the problem...
34Improving Apriori Performance
- Hash-based itemset counting
- A k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent - Transaction reduction
- A transaction that does not contain any frequent
k-itemset is useless in subsequent scans - Partitioning
- Any itemset that is potentially frequent in DB
must be frequent in at least one of the
partitions of DB - Sampling
- Mining on a subset of given data, lower support
threshold a method to determine the completeness
35Association Rules from Itemsets
- Pseudo-code
- for every frequent itemset l
- generate all nonempty subsets s of l
- for every nonempty subset s of l
- output the rule "s ? (l-s)" if
support(l)/support(s) ? min_conf", where
min_conf is the minimum confidence threshold - E.g. frequent set l abc, subsets s a, b,
c, ab, ac, bc) - a ? b, a ? c, b ? c
- a ? bc, b ? ac, c ? ab
- ab ? c, ac ? b, bc ? a
36Association Rule Generation
- Rule 1 to remember
- Generating frequent sets is slow (especially
itemsets of size 2) - Generating association rules from frequent
itemsets is fast - Rule 2 to remember
- For itemset generation, support threshold is used
- For association rules, confidence threshold is
used - What happens in reality, how long does it take to
create frequent sets and association rules? - Let's take small real-life examples
- Experiments are made with Citum 4/275 Alpha
server with 512 MB of main memory Red Hat Linux
release 5.0 (kernel 2.0.30)
37Performance Example (1/4)
Alarms
38Performance Example (2/4)
- Telecom data containing alarms
- 1234 EL1 PCM 940926082623 A1 ALARMTEXT..
- Example data 1
- 43 478 alarms (26.9.94 - 5.10.94 10 days)
- 2 234 different types of alarms, 23 attributes,
5503 different values - Example data 2
- 73 679 alarms (1.2.95 - 22.3.95 7 weeks)
- 287 different types of alarms, 19 attributes,
3411 different values
Alarm type
Date, time
Alarm severity class
Alarming network element
Alarm number
39Performance Example (3/4)
Example rule alarm_number1234, alarm_typePCM
? alarm_severityA1 2,45
40Performance Example (4/4)
- Example results for data 1
- Frequency threshold 0.1 (lowest
possible with this data) - Candidate sets 109 719 Time 12.02 s
- Frequent sets 79 311 Time 64 855.73 s
- Rules 3 750 000 Time 860.60 s
- Example results for data 2
- Frequency threshold 0.1 (lowest
possible with this data) - Candidate sets 43 600 Time 1.70 s
- Frequent sets 13 321 Time 10 478.93 s
- Rules 509 075 Time 143.35 s
41Selecting the Interesting Rules?
- Usually the result set is very big, one must
select interesting ones based on - Objective measures
- Two popular measurements
- support and
- confidence
- Subjective measures (Silberschatz Tuzhilin,
KDD95) - A rule (pattern) is interesting if
- it is unexpected (surprising to the user)
and/or - actionable (the user can do something with it)
- These issues will be discussed with KDD processes
42Boolean vs. Quantitative Rules
- Boolean vs. quantitative association rules (based
on the types of values handled) - Boolean Rule concerns associations between the
presence or absence of items (e.g. "buys A" or
"does not buy A") - buysSQLServer, buysDMBook ? buysDBMiner
2,60 - buys(x, "SQLServer") buys(x, "DMBook")
buys(x, "DBMiner") 0.2, 60 - Quantitative Rule concerns associations between
quantitative items or attributes - age30..39, income42..48K ? buysPC 1, 75
- age(x, "30..39") income(x, "42..48K")
buys(x, "PC") 1, 75
43Quantitative Rules
- Quantitative attributes e.g., age, income,
height, weight - Categorical attributes e.g., color of car
Problem too many distinct values for
quantitative attributes Solution transform
quantitative attributes in categorical ones via
discretization ? more about this in seminar!
44Single- vs. Multi-dimensional Rules
- Single-dimensional vs. multi-dimensional
associations - Single-dimensional Items or attributes in the
rule refer to only one dimension (e.g., to
"buys") - Beer, Chips ? Bread 0.4, 52
- buys(x, "Beer") buys(x, "Chips") buys(x,
"Bread") 0.4, 52 - Multi-dimensional Items or attributes in the
rule refer to two or more dimensions (e.g.,
"buys", "time_of_transaction", "customer_category"
) - In the following example nationality, age,
income
45Multi-dimensional Rules
RULES nationality French ? income high 50,
100 income high ? nationality French 50,
75 age 50 ? nationality Italian 33, 100
46Single- vs. Multi-level Rules
- Single-level vs. multi-level associations
- Single-level Associations between items or
attributes from the same level of abstraction
(i.e., from the same level of hierarchy) - Beer, Chips ? Bread 0.4, 52
-
- Multi-level Associations between items or
attributes from different levels of abstraction
(i.e, from different levels of hierarchy) - BeerKarjala, ChipsEstrellaBarbeque ? Bread
0.1, 74 -
- More about multi-level association rules on the
next slides
47Multi-level Association Rules
- Is difficult to find interesting patterns at a
too primitive level - high support too few rules
- low support too many rules, most uninteresting
- Approach reason at suitable level of abstraction
- A common form of background knowledge is that an
attribute may be generalized or specialized
according to a hierarchy of concepts - Multi-level association rules rules which
combine associations with hierarchy of concepts
48Multi-level Association Rules
- Items often form hierarchies
- Items at the lower level are expected to have
lower support - Rules regarding itemsets at appropriate levels
could be quite useful - Transaction database can be encoded based on
dimensions and levels
49Multi-level Association Rules
1
2
1
2
1
2
1
2
121 milk - 2 - Fraser
50Multi-level Association Rules
- A top-down, progressive deepening approach
- First find high-level strong rules
- milk bread 20,
60 - Then find their lower-level "weaker" rules
- 2 milk wheat bread
6, 50 - Variations at mining multi-level association
rules - Level-crossed association rules
- milk wheat bread
- Association rules with multiple, alternative
hierarchies - milk Wonder bread
51Multi-level Association Rules
- Generalizing/specializing values of attributes
- ...from specialized to general support of rules
increases (new rules may become valid) - ...from general to specialized support of rules
decreases (rules may become not valid, their
support falls under the threshold) - Too low level gt too many rules and too primitive
- Pepsi light 0.5l bottle ? Taffel Barbeque Chips
200gr - Too high level gt uninteresting rules
- Food ? Clothes
52Redundancy Filtering
- Some rules may be redundant due to "ancestor"
relationships between items - Example (milk has 4 subclasses)
- milk ? wheat bread support 8, confidence
70 - 2 milk ? wheat bread support 2, confidence
72 - We say the first rule is an ancestor of the
second rule - A rule is redundant if its support is close to
the "expected" value, based on the rules
ancestor - Above the second rule could be redundant
53Uniform vs. Reduced Support
- Uniform Support the same minimum support for all
levels - One minimum support threshold. No need to
examine itemsets containing any item whose
ancestors do not have minimum support. - Lower level items do not occur as frequently.
If support threshold - too high ? miss low level associations
- too low ? generate too many high level
associations - Reduced Support reduced minimum support at lower
levels
54Uniform Support
Multi-level mining with uniform support
Level 1 min_sup 5
Milk support 10
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 5
55Reduced Support
Multi-level mining with reduced support
Level 1 min_sup 5
Milk support 10
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 3
56Progressive Deepening
- A top-down, progressive deepening approach
- First mine high-level frequent items
- milk (15), bread
(10) - Then mine their lower-level "weaker" frequent
itemsets - 2 milk (5),
wheat bread (4) - Different min_support thresholds across
multi-levels lead to different algorithms - If adopting the same min_support across
multi-levels - then do not examine t if any of ts ancestors is
infrequent - If adopting reduced min_support at lower levels
- then examine only those descendents whose
ancestors support is frequent/non-negligible
57Constraint-Based Mining
- Interactive, exploratory mining giga-bytes of
data? - Could it be real? By making good use of
constraints! - What kinds of constraints can be used in mining?
- Knowledge type constraint classification,
association, etc. - Data constraint SQL-like queries
- Find product pairs sold together in Vancouver in
Dec.98 - Dimension/level constraints
- In relevance to region, price, brand, customer
category - Interestingness constraints
- Strong rules (min_support ? 3, min_confidence ?
60) - Rule constraints (see the next slides)
58Rule Constraints
- Two kinds of rule constraints
- Rule form constraints meta-rule guided mining
- Metarule P(X, Y) Q(X, W) takes(X,
"database systems") - Matching rule age(X, "30..39") income(X,
"41K..60K") takes(X, "database systems"). - Rule content constraint constraint-based query
optimization (Ng, et al., SIGMOD98) - sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
sum(RHS) gt 1000
59Rule Constraints
- 1-variable vs. 2-variable constraints
(Lakshmanan, et al. SIGMOD99) - 1-var A constraint confining only one side (L/R)
of the rule, e.g., - sum(LHS) lt 100 min(LHS) gt 20 count(LHS) gt 3
sum(RHS) gt 1000 - 2-var A constraint confining both sides (L and
R). - sum(LHS) lt min(RHS) max(RHS) lt 5 sum(LHS)
60Summary
- Association rule mining
- Probably the most significant contribution from
the database community in KDD - Rather simple concept, but the "thinking" gives
basis for extensions and other methods - A large number of papers have been published
- Many interesting issues have been explored
- Interesting research directions
- Association analysis in other types of data
spatial data, multimedia data, time series data,
etc.
61References (1/5)
- R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A
tree projection algorithm for generation of
frequent itemsets. In Journal of Parallel and
Distributed Computing (Special Issue on High
Performance Data Mining), 2000. - R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large
databases. SIGMOD'93, 207-216, Washington, D.C. - R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB'94 487-499,
Santiago, Chile. - R. Agrawal and R. Srikant. Mining sequential
patterns. ICDE'95, 3-14, Taipei, Taiwan. - R. J. Bayardo. Efficiently mining long patterns
from databases. SIGMOD'98, 85-93, Seattle,
Washington. - S. Brin, R. Motwani, and C. Silverstein. Beyond
market basket Generalizing association rules to
correlations. SIGMOD'97, 265-276, Tucson,
Arizona. - S. Brin, R. Motwani, J. D. Ullman, and S. Tsur.
Dynamic itemset counting and implication rules
for market basket analysis. SIGMOD'97, 255-264,
Tucson, Arizona, May 1997. - K. Beyer and R. Ramakrishnan. Bottom-up
computation of sparse and iceberg cubes.
SIGMOD'99, 359-370, Philadelphia, PA, June 1999. - D.W. Cheung, J. Han, V. Ng, and C.Y. Wong.
Maintenance of discovered association rules in
large databases An incremental updating
technique. ICDE'96, 106-114, New Orleans, LA. - M. Fang, N. Shivakumar, H. Garcia-Molina, R.
Motwani, and J. D. Ullman. Computing iceberg
queries efficiently. VLDB'98, 299-310, New York,
NY, Aug. 1998.
62References (2/5)
- G. Grahne, L. Lakshmanan, and X. Wang. Efficient
mining of constrained correlated sets. ICDE'00,
512-521, San Diego, CA, Feb. 2000. - Y. Fu and J. Han. Meta-rule-guided mining of
association rules in relational databases.
KDOOD'95, 39-46, Singapore, Dec. 1995. - T. Fukuda, Y. Morimoto, S. Morishita, and T.
Tokuyama. Data mining using two-dimensional
optimized association rules Scheme, algorithms,
and visualization. SIGMOD'96, 13-23, Montreal,
Canada. - E.-H. Han, G. Karypis, and V. Kumar. Scalable
parallel data mining for association rules.
SIGMOD'97, 277-288, Tucson, Arizona. - J. Han, G. Dong, and Y. Yin. Efficient mining of
partial periodic patterns in time series
database. ICDE'99, Sydney, Australia. - J. Han and Y. Fu. Discovery of multiple-level
association rules from large databases. VLDB'95,
420-431, Zurich, Switzerland. - J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. SIGMOD'00,
1-12, Dallas, TX, May 2000. - T. Imielinski and H. Mannila. A database
perspective on knowledge discovery.
Communications of ACM, 3958-64, 1996. - M. Kamber, J. Han, and J. Y. Chiang.
Metarule-guided mining of multi-dimensional
association rules using data cubes. KDD'97,
207-210, Newport Beach, California. - M. Klemettinen, H. Mannila, P. Ronkainen, H.
Toivonen, and A.I. Verkamo. Finding interesting
rules from large sets of discovered association
rules. CIKM'94, 401-408, Gaithersburg, Maryland.
63References (3/5)
- F. Korn, A. Labrinidis, Y. Kotidis, and C.
Faloutsos. Ratio rules A new paradigm for fast,
quantifiable data mining. VLDB'98, 582-593, New
York, NY. - B. Lent, A. Swami, and J. Widom. Clustering
association rules. ICDE'97, 220-231, Birmingham,
England. - H. Lu, J. Han, and L. Feng. Stock movement and
n-dimensional inter-transaction association
rules. SIGMOD Workshop on Research Issues on
Data Mining and Knowledge Discovery (DMKD'98),
121-127, Seattle, Washington. - H. Mannila, H. Toivonen, and A. I. Verkamo.
Efficient algorithms for discovering association
rules. KDD'94, 181-192, Seattle, WA, July 1994. - H. Mannila, H Toivonen, and A. I. Verkamo.
Discovery of frequent episodes in event
sequences. Data Mining and Knowledge Discovery,
1259-289, 1997. - R. Meo, G. Psaila, and S. Ceri. A new SQL-like
operator for mining association rules. VLDB'96,
122-133, Bombay, India. - R.J. Miller and Y. Yang. Association rules over
interval data. SIGMOD'97, 452-461, Tucson,
Arizona. - R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang.
Exploratory mining and pruning optimizations of
constrained associations rules. SIGMOD'98, 13-24,
Seattle, Washington. - N. Pasquier, Y. Bastide, R. Taouil, and L.
Lakhal. Discovering frequent closed itemsets for
association rules. ICDT'99, 398-416, Jerusalem,
Israel, Jan. 1999.
64References (4/5)
- J.S. Park, M.S. Chen, and P.S. Yu. An effective
hash-based algorithm for mining association
rules. SIGMOD'95, 175-186, San Jose, CA, May
1995. - J. Pei, J. Han, and R. Mao. CLOSET An Efficient
Algorithm for Mining Frequent Closed Itemsets.
DMKD'00, Dallas, TX, 11-20, May 2000. - J. Pei and J. Han. Can We Push More Constraints
into Frequent Pattern Mining? KDD'00. Boston,
MA. Aug. 2000. - G. Piatetsky-Shapiro. Discovery, analysis, and
presentation of strong rules. In G.
Piatetsky-Shapiro and W. J. Frawley, editors,
Knowledge Discovery in Databases, 229-238.
AAAI/MIT Press, 1991. - B. Ozden, S. Ramaswamy, and A. Silberschatz.
Cyclic association rules. ICDE'98, 412-421,
Orlando, FL. - J.S. Park, M.S. Chen, and P.S. Yu. An effective
hash-based algorithm for mining association
rules. SIGMOD'95, 175-186, San Jose, CA. - S. Ramaswamy, S. Mahajan, and A. Silberschatz. On
the discovery of interesting patterns in
association rules. VLDB'98, 368-379, New York,
NY.. - S. Sarawagi, S. Thomas, and R. Agrawal.
Integrating association rule mining with
relational database systems Alternatives and
implications. SIGMOD'98, 343-354, Seattle, WA. - A. Savasere, E. Omiecinski, and S. Navathe. An
efficient algorithm for mining association rules
in large databases. VLDB'95, 432-443, Zurich,
Switzerland. - A. Savasere, E. Omiecinski, and S. Navathe.
Mining for strong negative associations in a
large database of customer transactions. ICDE'98,
494-502, Orlando, FL, Feb. 1998.
65References (5/5)
- C. Silverstein, S. Brin, R. Motwani, and J.
Ullman. Scalable techniques for mining causal
structures. VLDB'98, 594-605, New York, NY. - R. Srikant and R. Agrawal. Mining generalized
association rules. VLDB'95, 407-419, Zurich,
Switzerland, Sept. 1995. - R. Srikant and R. Agrawal. Mining quantitative
association rules in large relational tables.
SIGMOD'96, 1-12, Montreal, Canada. - R. Srikant, Q. Vu, and R. Agrawal. Mining
association rules with item constraints. KDD'97,
67-73, Newport Beach, California. - H. Toivonen. Sampling large databases for
association rules. VLDB'96, 134-145, Bombay,
India, Sept. 1996. - D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton,
R. Motwani, and S. Nestorov. Query flocks A
generalization of association-rule mining.
SIGMOD'98, 1-12, Seattle, Washington. - K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita,
and T. Tokuyama. Computing optimized rectilinear
regions for association rules. KDD'97, 96-103,
Newport Beach, CA, Aug. 1997. - M. J. Zaki, S. Parthasarathy, M. Ogihara, and W.
Li. Parallel algorithm for discovery of
association rules. Data Mining and Knowledge
Discovery, 1343-374, 1997. - M. Zaki. Generating Non-Redundant Association
Rules. KDD'00. Boston, MA. Aug. 2000. - O. R. Zaiane, J. Han, and H. Zhu. Mining
Recurrent Items in Multimedia with Progressive
Resolution Refinement. ICDE'00, 461-470, San
Diego, CA, Feb. 2000.
66Course Organization
Next Week
- Lecture 31.10. Episodes and recurrent patterns
- Mika gives the lecture
- Excercise 1.11. Associations
- Pirjo takes care of you! -)
- Seminar 2.11. Associations
- Pirjo gives the lecture
- 2 group presentations
67Seminar Presentations
- Seminar presentations
- Articles are given on previous week's Wed
- Presentation in an HTML page (around 3-5 printed
pages) due to seminar starting - Can be either a HTML page or a printable document
in PostScript/PDF format - 30 minutes of presentation
- 5-15 minutes of discussion
- Active participation
68Seminar Presentations
- Seminar presentations
- Try to understand the "message" in the article
- Try to present the basic ideas as clearly as
possible, use examples - Do not present detailed mathematics or algorithms
- Test do you understand your own presentation?
- In the presentation, use PowerPoint or
conventional slides
69Seminar Presentations/Groups 1-2
Quantitative Rules
R. Srikant, R. Agrawal "Mining Quantitative
Association Rules in Large Relational Tables",
Proc. of the ACM-SIGMOD 1996
MINERULE
Rosa Meo, Giuseppe Psaila, Stefano Ceri "A New
SQL-like Operator for Mining Association Rules".
VLDB 1996 122-133
70Introduction to Data Mining (DM)
Thank you for your attention and have a nice
weekend! Thanks to Jiawei Han from Simon Fraser
University for his slides which greatly helped
in preparing this lecture! Also thanks to Fosca
Giannotti and Dino Pedreschi from Pisa for their
slides.