Data Mining

About This Presentation
Title:

Data Mining

Description:

Data Mining. Jim. Jim's cows. Which cows should I buy?? Jim's cows. Abdul. Paula. Quirri. Mary. Lisa. Mona. Name. Bad. 7. 10. Good. 6. 2. Bad. 5. 6. Good. 3. 8 ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 34
Provided by: gidi7

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
2
Jim
Jims cows
Which cows should I buy??
3
Jims cows
Cows on sale
Which cows should I buy??
4
Which cows should I buy??
  • And suppose I know their behavior, preferred
    mating months, milk production, nutritional
    habits, immune system data?
  • Now suppose I have 10,000 cows

5
understanding data
  • Trying to find patterns in data is not new
    hunters seek patterns in animal migration,
    politicians in voting habits, people in their
    partners behavior, etc.
  • However, the amount of available data is
    increasing very fast (exponentially?).
  • This gives greater opportunities to extract
    valuable information from the data.
  • But it also makes the task of understanding the
    data with conventional tools very difficult.

6
Data Mining
  • Data Mining The process of discovering patterns
    in data, usually stored in a Database. The
    patterns lead to advantages (economic or other).
  • Very fast growing area of research
  • Because
  • Huge databases (Walmart-20 mil transactions/day)
  • Automatic data capture of transactions (Bar code,
    satellites, scanners, cameras, etc.)
  • Large financial advantage
  • Evolving analytical methods

7
Data Mining techniques in some huji courses
8
Data Mining
  • Two extremes for the expression of the patterns
  • Black Box Buy cow Zehava, Petra and Paulina
  • Transparent Box (Structural Patterns) Buy
    cows with age300 or cows with calm
    behavior and 90 liters of milk production per
    month

9
The weather example
Today is Overcast, mild temperature, high
humidity, and windy. Will we play?
10
Questions one can ask
  • A set of rules learned from this data could be
    presented in a Decision List
  • If outlooksunny and humidityhigh then playno
  • ElseIf outlookrainy and windytrue then playno
  • ElseIf outlookovercast then playyes
  • ElseIf humiditynormal then playyes
  • Else playyes
  • This is an example of Classification Rules
  • We could also look for Association Rules
  • If temperaturecool then humiditynormal
  • If windyfalse and playno then outlooksunny and

  • humidityhigh

11
Example Cont.
  • The previous example is very simplified. Real
    Databases will probably
  • Contain Numerical values as well.
  • Contain Noise and errors.
  • Be a lot larger.
  • And the analysis we are asked to perform might
    not be of Association Rules, but rather Decision
    Trees, Neural Networks, etc.

12
Caution
  • David Rhine was a parapsychologist in the
    1930-1950s
  • He hypothesized that some people have
    Extra-Sensory Perception (ESP)
  • He asked people to say if 10 hidden cards are red
    or blue.
  • He discovered that almost 1 in every 1000 people
    has ESP !
  • He told these people that they have ESP and
    called them in for another test
  • He discovered almost all of them had lost their
    ESP !
  • He concluded that
  • You shouldnt tell people they have ESP, it
    caused them to loose it.
  • Source J. Ullman

13
Another Example
  • A classic example is a Database which holds data
    concerning purchases in a supermarket.
  • Each Shopping Basket is a list of items that were
    bought in a single purchase by some customer.
  • Such huge DBs which are saved for long periods
    of time are called Data Warehouses.
  • It is extremely valuable for the manager of the
    store to extract Association Rules from the huge
    Data Warehouse.
  • It is even more valuable if this information can
    be associated with the person buying, hence the
    Club Memberships

14
Supermarket Example
  • For example, if Beer and Diapers are found to be
    bought together often, this might encourage the
    manager to give a discount for purchasing Beer,
    Diapers and a new product together.
  • Another example if older people are found to be
    more loyal to a certain brand than young
    people, a manager might not promote a new brand
    of shampoo, intended for older people.

15
The Purchases Relation
Itemset A set of items Support of an itemset
the fraction of transactions that contain all
items in the itemset.
  • What is the Support of
  • pen?
  • pen, ink?
  • pen, juice?

16
Frequent Itemsets
  • We would like to find items that are purchased
    together in high frequency- Frequent Itemsets.
  • We look for itemsets which have a
    support minSupport.
  • If minSupport is set to 0.7, then the frequent
    itemsets in our example would be
  • pen, ink, milk, pen, ink, pen, milk
  • The A-Priori property of frequent itemsets Every
    subset of a frequent itemset is also a frequent
    itemset.

17
Algorithm for finding Frequent itemsets
  • Suppose we have n items.
  • The naĂŻve approach for every subset of items,
    check if it is frequent.
  • Very expensive
  • Improvement (based on the A-priori property)
    first identify frequent itemsets of size 1, then
    try to expand them.
  • Greatly reduces the number of candidate frequent
    itemsets.
  • A single scan of the table is enough to determine
    which candidate itemsets, are frequent.
  • The algorithm terminates when no new frequent
    itemsets are found in an iteration.

18
Algorithm for finding Frequent itemsets
  • foreach item, check if it is a frequent itemset
    (appears in minSupport of the transactions)
  • k1
  • repeat
  • foreach new frequent itemset Ik with k items
  • Generate all itemsets Ik1 with k1 items, such
    that Ik is contained in Ik1.
  • scan all transactions once and add itemsets that
    have support minSupport.
  • k
  • until no new frequent itemsets are found

19
Finding Frequent itemsets, on table Purchases,
with minSupport0.7
  • In the first run, the following single itemsets
    are found to be frequent pen, ink, milk.
  • Now we generate the candidates for k2 pen,
    ink, pen, milk, pen, juice, ink, milk,
    ink, juice and milk, juice.
  • By scanning the relation, we determine that the
    following are frequent pen, ink, pen, milk.
  • Now we generate the candidates for k3 pen,
    ink, milk, pen, milk, juice, pen, ink,
    juice.
  • By scanning the relation, we determine that none
    of these are frequent, and the algorithm ends
    with pen, ink, milk, pen,
    ink, pen, milk

20
Algorithm refinement
  • One important refinement after the
    candidate-generation phase, and before the scan
    of the relation (A-priori), eliminate candidate
    itemsets in which there is a subset which is not
    frequent. This is due to the A-Priori property.
  • In the second iteration, this means we would
    eliminate pen, juice, ink, juice and milk,
    juice as candidates since juice is not
    frequent. So we only check pen, ink,
    pen, milk and ink, milk.
  • So only pen, ink, milk is generated as a
    candidate, but it is eliminated before the scan
    because ink, milk is not frequent.
  • So we dont perform the 3rd iteration of the
    relation.
  • More complex algorithms use the same tools
    iterative generation and testing of candidate
    itemsets.

21
Association Rules
  • Up until now we discussed identification of
    frequent item sets. We now wish to go one step
    further.
  • An association rule is of the structure
    pen ink
  • It should be read as if a pen is purchased in a
    transaction, it is likely that ink will also be
    purchased in that transaction.
  • It describes the data in the DB (past).
    Extrapolation to future transactions should be
    done with caution.
  • More formally, an Association Rule is LHSRHS,
    where both LHS and RHS are sets of items, and
    implies that if every item in LHS was purchased
    in a transaction, it is likely that the items in
    RHS are purchased as well.

22
Measures for Association Rules
  • Support of LHSRHS is the support of the
    itemset (LHS U RHS). In other words the fraction
    of transactions that contain all items in (LHS U
    RHS) .
  • Confidence of LHSRHS Consider all
    transactions which contain all items in LHS. The
    fraction of these transactions that also contain
    all items in RHS, is the confidence of RHS.
  • S(LHS U RHS)/S(LHS)
  • The confidence of a rule is an indication of the
    strength of the rule.
  • What is the support of penink? And the
    confidence?
  • What is the support of inkpen? And the
    confidence?

23
Finding Association rules
  • A user can ask for rules with minimum support
    minSup and minimum confidence minConf.
  • Firstly, all frequent itemsets with
    supportminSup are computed with the previous
    Algorithm.
  • Secondly, rules are generated using the frequent
    itemsets, and checked for minConf.

24
Finding Association rules
  • Find all frequent itemsets using the previous
    alg.
  • For each frequent itemset X with support S(X)
  • For each division of X into 2 itemsets
  • Divide X into 2 itemsets LHS and RHS.
  • The Confidence of LHSRHS is S(X)/S(LHS).
  • We computed S(LHS) in the previous algorithm
    (because LHS is frequent since X is frequent).

25
Generalized association rules
We would like to know if the rule penjuice
is different on the first day of the month
compared to other days. How? What are its support
and confidence generally? And on the first days
of the month?
26
Generalized association rules
  • By specifying different attributes to group by
    (date in the last example), we can come up with
    interesting rules which we would otherwise miss.
  • Another example would be to group by location and
    check if the same rules apply for customers from
    Jerusalem compared to Tel Aviv.
  • By comparing the support and confidence of the
    rules we can observe differences in the data on
    different conditions.

27
Caution in prediction
  • When we find a pattern in the data, we wish to
    use it for prediction (that is in many case the
    point).
  • However, we have to be cautious about this.
  • For example suppose penink has a high
    support and confidence. We might give a discount
    on pens in order to increase sales of pens and
    therefore also in sales of ink.
  • However, this assumes a causal link between pen
    and ink.

28
Caution in prediction
  • Suppose pens and pencils are always sold together
  • We would then also get the rule pencilink
    with the same support and confidence as
    penink
  • However, it is clear there is no causal link
    between buying pencils and buying ink.
  • If we promoted pencils it would not cause an
    increase in sales of ink, despite high support
    and confidence.
  • The chance to infer wrong rules (rules which
    are not causal links) decreases as the DB size
    increases, but we should keep in mind that such
    rules do come up.
  • Therefore, the generated rules are a only good
    starting point for identifying causal links.

29
Classification and Regression rules
  • Consider the following relation
  • InsuranceInfo(age integer, carType string,
    highRisk bool)
  • The relation holds information about current
    customers.
  • The company wants to use the data in order to
    predict if a new customer, whose age and carType
    are known, is at high risk (and therefore charge
    higher insurance fee of course).
  • Such a rule for example could be if age is
    between 18 and 23, and carType is either sports
    or truck, the risk is high.

30
Classification and Regression rules
  • Such rules, where we are only interested in
    predicting one attribute are special.
  • The attribute which we predict is called the
    Dependent attribute.
  • The other attributes are called the Predictor
    attributes.
  • If the dependant attribute is categorical, we
    call such rules classification rules.
  • If the dependent attribute is numerical, we call
    such rules regression rules.

31
Regression in a nutshell
Jims cows (training set)
new cow (test set)
32
Regression in a nutshell
  • Assume that the Rate is a linear combination of
    the other attributes
  • Rate w0 w1BP w2MA w3AGE w4NOC
  • Our goal is thus to find w0, w1, w2, w3, w4
    (which actually means how strongly each attribute
    affects the Rate)
  • We thus want to minimize
  • S(Rate(i)-w0 w1BP(i) w2MA(i) w3AGE(i)
    w4NOC(i) )

iCow number
i
Prediction of Rate using w0-w4
Real Rate
33
Regression in a nutshell
  • This minimization is pretty straightforward
    (though outside the scope of this course).
  • It will give better coefficients the larger the
    training set is.
  • Of course, the rate is not deterministic.
  • The assumption that the sum is linear is wrong in
    many cases. Hence the use of SVM, Neural
    Networks, etc.
  • Notice this only deals with the case of all
    attributes being numerical.
Write a Comment
User Comments (0)