Title: Query Flocks
1Query Flocks
- Umar Hammoud
- Elizabeth Cash
- March 25, 2003
2Presentation Based On
- Paper Title
- Query Flocks A Generalization of
Association-Rule Mining - Authors  Dick Tsur  Jeffrey D. Ullman Â
Serge Abiteboul  Chris Clifton  Rajeev
Motwani  Svetlozar Nestorov  Arnon Rosenthal
3Association-Rule
- The goal is to find sets of items that are
associated
- The fact of their association is called
association-rule
4Market Basket Mining
- Understand the behavior of the customers when
they shop to improve marketing
- An attempt by retail store to learn what items
its customers purchase together
- A way to find items that tend to appear together
in a market basket
5Precise Measures of Association
- Given a relation,
- baskets(BID, Item) where BID is basket
ID
1. Support The items must appear in many
baskets.
2. Confidence The probability of one item given
that the others are in the basket must be high.
3. Interest That probability must be
significantly higher or lower than the expected
probability if items were purchased at random.
6ExamplesMeasures of Association
- People who buy milk often by cereal. cereal,
milk -
1. High support means that many people buy both
cereal and milk
2. High confidence means that a lot of people
who buy cereal also buy milk.
3. High interest means that if you buy cereal,
then you are much more likely to buy milk than
the general population.
7Association-Rule Optimization
- Can be optimized by taking advantage of many of
the query optimization ideas (e.g. a-priori)
8The A-Priori Optimization
Let S be a set of items that appear in at least n
baskets And S is subset of S Then S appears in
at least n baskets
- Using this technique tuples can be eliminated
before the join
9 A-Priori Generalization
- Extended to provide efficient mining of very
large databases, for many different kinds of
patterns.
- Can be used for
- general-purpose mining systems
- future generation of query optimizers.
10Query Flocks
- A parameterized query with a filter condition to
eliminate the uninteresting values of the
parameters
11Mining Languages
- Can SQL be used as a mining Language?
 In principal, it can, but right optimization is
not there.
12SQL Whats the Problem?
- The A-Priori trick has not been implemented by
any conventional optimizer
- SELECT i1.Item, i2.Item
- FROM baskets i1, baskets i2
- WHERE i1.Item
- i1.BID i2.BID
- GROUP BY i1.Item, i2.Item
- HAVING 20
- Better performance can be achieved if the query
is rewritten in the following way - First find those items that appeared in at least
20 baskets - Join the set of these items with the baskets
relation
13Mining with Flocks
- Many data mining problems can benefit from the
A-priori for code optimization
- The Formalism of query flocks is an important
tool for building better optimizers
14Query Flocks
- A family of identical queries that are asked
simultaneously
- The answers to these queries are filtered
- The ones filtered enable their parameters to
become part of the answer
15Query Flock Settings
- Queries are parameterized by one or more
parameter
- Ability to express filter conditions about the
results of the query
16Query Flock Designation
- One or more predicates that represent data stored
as relations - A set of parameters with names starting with
- A query
- A filter that specifies a condition
17Language for Flocks
- Conjunctive Queries augmented with arithmetic
and with union
- Datalog is used rather than SQL because it gives
the following capabilities - The notion of safe query for Datalog figures
into potential optimizations - The set of options for adapting the A-priori
trick to arbitrary flocks is most easily
expressed in Datalog
- SQL is used for the filter language only
18Market Basket as a Query Flock
QUERY
Answer(B) - baskets(B,1) AND baskets(B,2)
FILTER
COUNT(answer.B) 20
19Language Extensions
- To apply query optimizations proposed, extensions
must be added - Negated subgoals
- Arithmetic subgoals for variables and parameters
20Extensions Usage
- Add arithmetic extension to the previous query to
restrict item pairs to appear in lexicographic
order
Answer(B) - baskets(B,1) AND baskets(B,2)
AND 1
21Extensions Usage
- Given the following relations
- diagnoses(Patient, Disease)
- exhibits(Patient, Symptom)
- treatments(Patient,Medicine )
- causes(Disease, Symptom)
- Find unexplained side effects
QUERY answer(P) - exhibits(P,s)
AND treatment(P,m) AND diagnosis(P,D)
AND NOT causes(D,s) FILTER COUNT(answer.P)
20
22Generalizing A-Priori Techniques
- Evaluate the less expensive query first
The answer allows us to upper bound the size of
the answer obtained with certain parameters.
If the bound is less than the filter threshold,
eliminate the certain values of parameters
without further consideration
For Query Q1 to puts an upper bound on the size
of the result of query Q2 It must be provable
that the result of Q2 is a subset of the result
of Q1
- The containment-mapping theorem says
- Q2 ? Q1 can hold if Q1 is constructed from Q2
by
- Taking a subset of the subgoals of Q2, and
- Splitting zero or more variables into several
variables.
23Safe Query Example
answer(B) - baskets(B,1) AND baskets(B,2)
AND
 Two formed by taking two proper subsets of
subgoals
answer(B) - baskets(B,1)
and
answer(B) - baskets(B,2)
24Safe Query Example cont.
If we take the first, we can ask
- What values of 1 does the query answer (B) -
baskets (B, 1) - Produce a number of values of B that is over the
threshold given in the filter.
- Any other value of 1 can be eliminated as
member of a pair of items meeting the filter
condition
25Search for Optimal Query-Flock Evaluators
R(P) FILTER(P,Q,C)
P set of parameters
Q query involving parameters P
R relation whose tuples are values of parameter P
C condition on the result of the query Q
26A Query Plan
- okS(s) FILTER(s,
- answer(P) - exhibits(P,s),
- COUNT(answer.P) 20)
- okM(m) FILTER(m,
- answer(P) - treatments(P,m),
- COUNT(answer.P) 20)
- ok(s,m) FILTER(s,m,
- answer(P) -
- okS(s) AND
- okM(m) AND
- diagnoses(P,D) AND
- exhibits(P,s) AND
- treatments(P,m) AND
- NOT causes(D,s),
- COUNT(answer.P) 20)
27Is there a Rule for Generating the Query
Plans?????
- Consider only sequences of filter steps that
satisfy these conditions
- Steps must use same filter condition as original
query flock query
- Each step must define a uniquely named relation
- Each step derived from the given query flock by
following
- Start with original query flock
- Add in zero or more subgoals that are copies of
the left side of the assignment ( ) in some
previous filter step
- Delete zero or more subgoals but, following the
optimization principle for conjunctive queries,
make sure that the resulting query is safe.
-
- The final step must not delete any subgoals of
the original query it may have additional
subgoals derived from previous steps, of course.
28Exponential SearchQuery Plan
- Candidate for best possible
- Long sequence of steps in which each uses the
results of the previous step - How to restrict the search
- Select sets of parameters
- Select list of subsets of the subgoals of the
original query that form safe queries.
29Dynamic Selection of Filter Steps
- We let the sizes of intermediate relations
determine whether or not to apply filters - The important special case
- When the set of parameters for a relation has not
previously been encountered. - If support threshold is low, then it is likely to
be useful to filter - If support threshold is high, then it is unlikely
a useful filter
30Possible Query Plan
- temp1(s) FILTER(s,
- answer(P) - exhibits(P.s),
- COUNT(answer.X) 20
- )
- temp2(P,s,m) (temp1(s) JOIN
- exhibits(P,s)) JOIN treatments(P,m)
- temp3(s,m) FILTER(s,m,
- answer(P) - temp2(s,m).,
- COUNT(answer.X) 20
- )
- temp4(P,D,s,m) ((temp3(s,m) JOIN
- temp2(P,s,m))
- JOIN diagnoses(P,D)) JOIN
- (NOT causes(D,s)
- sideEffect(s,m) FILTER(s,m,
- answer(P) - temp4(P,D,s,m),
- COUNT(answer.X) 20
- )
31Conclusions
- Its a generate-and-test model for data-mining
problems - Uses "parts of queries" constructively to prune
answer sets for main queries - Provides a parameterized way to specify a set of
queries, whose answer is the parameter(s)
32So What Should Tim Tell His Mother?
- In one sentence
- Generalization of query optimization techniques
to be used for data mining. - And Questions?
-
-