Data Mining

About This Presentation

Title:

Data Mining

Description:

Data Mining. Jim. Jim's cows. Which cows should I buy?? Jim's cows. Abdul. Paula. Quirri. Mary. Lisa. Mona. Name. Bad. 7. 10. Good. 6. 2. Bad. 5. 6. Good. 3. 8 ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 34

Provided by: gidi7

more less

Transcript and Presenter's Notes

Title: Data Mining

1
Data Mining
2
Jim
Jims cows
Which cows should I buy??
3
Jims cows
Cows on sale
Which cows should I buy??
4
Which cows should I buy??

And suppose I know their behavior, preferred
mating months, milk production, nutritional
habits, immune system data?
Now suppose I have 10,000 cows

5
understanding data

Trying to find patterns in data is not new
hunters seek patterns in animal migration,
politicians in voting habits, people in their
partners behavior, etc.
However, the amount of available data is
increasing very fast (exponentially?).
This gives greater opportunities to extract
valuable information from the data.
But it also makes the task of understanding the
data with conventional tools very difficult.

6
Data Mining

Data Mining The process of discovering patterns
in data, usually stored in a Database. The
patterns lead to advantages (economic or other).
Very fast growing area of research
Because
Huge databases (Walmart-20 mil transactions/day)
Automatic data capture of transactions (Bar code,
satellites, scanners, cameras, etc.)
Large financial advantage
Evolving analytical methods

7
Data Mining techniques in some huji courses
8
Data Mining

Two extremes for the expression of the patterns
Black Box Buy cow Zehava, Petra and Paulina
Transparent Box (Structural Patterns) Buy
cows with age300 or cows with calm
behavior and 90 liters of milk production per
month

9
The weather example
Today is Overcast, mild temperature, high
humidity, and windy. Will we play?
10
Questions one can ask

A set of rules learned from this data could be
presented in a Decision List
If outlooksunny and humidityhigh then playno
ElseIf outlookrainy and windytrue then playno
ElseIf outlookovercast then playyes
ElseIf humiditynormal then playyes
Else playyes
This is an example of Classification Rules
We could also look for Association Rules
If temperaturecool then humiditynormal
If windyfalse and playno then outlooksunny and
humidityhigh

11
Example Cont.

The previous example is very simplified. Real
Databases will probably
Contain Numerical values as well.
Contain Noise and errors.
Be a lot larger.
And the analysis we are asked to perform might
not be of Association Rules, but rather Decision
Trees, Neural Networks, etc.

12
Caution

David Rhine was a parapsychologist in the
1930-1950s
He hypothesized that some people have
Extra-Sensory Perception (ESP)
He asked people to say if 10 hidden cards are red
or blue.
He discovered that almost 1 in every 1000 people
has ESP !
He told these people that they have ESP and
called them in for another test
He discovered almost all of them had lost their
ESP !
He concluded that
You shouldnt tell people they have ESP, it
caused them to loose it.
Source J. Ullman

13
Another Example

A classic example is a Database which holds data
concerning purchases in a supermarket.
Each Shopping Basket is a list of items that were
bought in a single purchase by some customer.
Such huge DBs which are saved for long periods
of time are called Data Warehouses.
It is extremely valuable for the manager of the
store to extract Association Rules from the huge
Data Warehouse.
It is even more valuable if this information can
be associated with the person buying, hence the
Club Memberships

14
Supermarket Example

For example, if Beer and Diapers are found to be
bought together often, this might encourage the
manager to give a discount for purchasing Beer,
Diapers and a new product together.
Another example if older people are found to be
more loyal to a certain brand than young
people, a manager might not promote a new brand
of shampoo, intended for older people.

15
The Purchases Relation
Itemset A set of items Support of an itemset
the fraction of transactions that contain all
items in the itemset.

What is the Support of
pen?
pen, ink?
pen, juice?

16
Frequent Itemsets

We would like to find items that are purchased
together in high frequency- Frequent Itemsets.
We look for itemsets which have a
support minSupport.
If minSupport is set to 0.7, then the frequent
itemsets in our example would be
pen, ink, milk, pen, ink, pen, milk
The A-Priori property of frequent itemsets Every
subset of a frequent itemset is also a frequent
itemset.

17
Algorithm for finding Frequent itemsets

Suppose we have n items.
The naïve approach for every subset of items,
check if it is frequent.
Very expensive
Improvement (based on the A-priori property)
first identify frequent itemsets of size 1, then
try to expand them.
Greatly reduces the number of candidate frequent
itemsets.
A single scan of the table is enough to determine
which candidate itemsets, are frequent.
The algorithm terminates when no new frequent
itemsets are found in an iteration.

18
Algorithm for finding Frequent itemsets

foreach item, check if it is a frequent itemset
(appears in minSupport of the transactions)
k1
repeat
foreach new frequent itemset Ik with k items
Generate all itemsets Ik1 with k1 items, such
that Ik is contained in Ik1.
scan all transactions once and add itemsets that
have support minSupport.
k
until no new frequent itemsets are found

19
Finding Frequent itemsets, on table Purchases,
with minSupport0.7

In the first run, the following single itemsets
are found to be frequent pen, ink, milk.
Now we generate the candidates for k2 pen,
ink, pen, milk, pen, juice, ink, milk,
ink, juice and milk, juice.
By scanning the relation, we determine that the
following are frequent pen, ink, pen, milk.
Now we generate the candidates for k3 pen,
ink, milk, pen, milk, juice, pen, ink,
juice.
By scanning the relation, we determine that none
of these are frequent, and the algorithm ends
with pen, ink, milk, pen,
ink, pen, milk

20
Algorithm refinement

One important refinement after the
candidate-generation phase, and before the scan
of the relation (A-priori), eliminate candidate
itemsets in which there is a subset which is not
frequent. This is due to the A-Priori property.
In the second iteration, this means we would
eliminate pen, juice, ink, juice and milk,
juice as candidates since juice is not
frequent. So we only check pen, ink,
pen, milk and ink, milk.
So only pen, ink, milk is generated as a
candidate, but it is eliminated before the scan
because ink, milk is not frequent.
So we dont perform the 3rd iteration of the
relation.
More complex algorithms use the same tools
iterative generation and testing of candidate
itemsets.

21
Association Rules

Up until now we discussed identification of
frequent item sets. We now wish to go one step
further.
An association rule is of the structure
pen ink
It should be read as if a pen is purchased in a
transaction, it is likely that ink will also be
purchased in that transaction.
It describes the data in the DB (past).
Extrapolation to future transactions should be
done with caution.
More formally, an Association Rule is LHSRHS,
where both LHS and RHS are sets of items, and
implies that if every item in LHS was purchased
in a transaction, it is likely that the items in
RHS are purchased as well.

22
Measures for Association Rules

Support of LHSRHS is the support of the
itemset (LHS U RHS). In other words the fraction
of transactions that contain all items in (LHS U
RHS) .
Confidence of LHSRHS Consider all
transactions which contain all items in LHS. The
fraction of these transactions that also contain
all items in RHS, is the confidence of RHS.
S(LHS U RHS)/S(LHS)
The confidence of a rule is an indication of the
strength of the rule.
What is the support of penink? And the
confidence?
What is the support of inkpen? And the
confidence?

23
Finding Association rules

A user can ask for rules with minimum support
minSup and minimum confidence minConf.
Firstly, all frequent itemsets with
supportminSup are computed with the previous
Algorithm.
Secondly, rules are generated using the frequent
itemsets, and checked for minConf.

24
Finding Association rules

Find all frequent itemsets using the previous
alg.
For each frequent itemset X with support S(X)
For each division of X into 2 itemsets
Divide X into 2 itemsets LHS and RHS.
The Confidence of LHSRHS is S(X)/S(LHS).
We computed S(LHS) in the previous algorithm
(because LHS is frequent since X is frequent).

25
Generalized association rules
We would like to know if the rule penjuice
is different on the first day of the month
compared to other days. How? What are its support
and confidence generally? And on the first days
of the month?
26
Generalized association rules

By specifying different attributes to group by
(date in the last example), we can come up with
interesting rules which we would otherwise miss.
Another example would be to group by location and
check if the same rules apply for customers from
Jerusalem compared to Tel Aviv.
By comparing the support and confidence of the
rules we can observe differences in the data on
different conditions.

27
Caution in prediction

When we find a pattern in the data, we wish to
use it for prediction (that is in many case the
point).
However, we have to be cautious about this.
For example suppose penink has a high
support and confidence. We might give a discount
on pens in order to increase sales of pens and
therefore also in sales of ink.
However, this assumes a causal link between pen
and ink.

28
Caution in prediction

Suppose pens and pencils are always sold together
We would then also get the rule pencilink
with the same support and confidence as
penink
However, it is clear there is no causal link
between buying pencils and buying ink.
If we promoted pencils it would not cause an
increase in sales of ink, despite high support
and confidence.
The chance to infer wrong rules (rules which
are not causal links) decreases as the DB size
increases, but we should keep in mind that such
rules do come up.
Therefore, the generated rules are a only good
starting point for identifying causal links.

29
Classification and Regression rules

Consider the following relation
InsuranceInfo(age integer, carType string,
highRisk bool)
The relation holds information about current
customers.
The company wants to use the data in order to
predict if a new customer, whose age and carType
are known, is at high risk (and therefore charge
higher insurance fee of course).
Such a rule for example could be if age is
between 18 and 23, and carType is either sports
or truck, the risk is high.

30
Classification and Regression rules

Such rules, where we are only interested in
predicting one attribute are special.
The attribute which we predict is called the
Dependent attribute.
The other attributes are called the Predictor
attributes.
If the dependant attribute is categorical, we
call such rules classification rules.
If the dependent attribute is numerical, we call
such rules regression rules.

31
Regression in a nutshell
Jims cows (training set)
new cow (test set)
32
Regression in a nutshell

Assume that the Rate is a linear combination of
the other attributes
Rate w0 w1BP w2MA w3AGE w4NOC
Our goal is thus to find w0, w1, w2, w3, w4
(which actually means how strongly each attribute
affects the Rate)
We thus want to minimize
S(Rate(i)-w0 w1BP(i) w2MA(i) w3AGE(i)
w4NOC(i) )

iCow number
i
Prediction of Rate using w0-w4
Real Rate
33
Regression in a nutshell

This minimization is pretty straightforward
(though outside the scope of this course).
It will give better coefficients the larger the
training set is.
Of course, the rate is not deterministic.
The assumption that the sum is linear is wrong in
many cases. Hence the use of SVM, Neural
Networks, etc.
Notice this only deals with the case of all
attributes being numerical.