Apresenta - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Apresenta

Description:

26.02% of men are heavy smokers. The probability for a man to be a heavy smoker, given that he is a heavy beer consumer is 37.58 ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 42
Provided by: COR81
Category:
Tags: apresenta | smoker

less

Transcript and Presenter's Notes

Title: Apresenta


1
ECML / PKDD 2004 Discovery Challenge
Mining Strong Associations and Exceptions in the
STULONG Data Set
Eduardo Corrêa Gonçalves and Alexandre Plastino
Universidade Federal Fluminense Department
of Computer Science Niterói, Rio de Janeiro,
Brazil egoncalves,plastino_at_ic.uff.br -
http//www.ic.uff.br
work sponsored by CNPq research grant 300879/00-8
2
Outline of the talk
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary

3
Atherosclerosis Data Set
  • STULONG Data Set risk factors of atherosclerosis
    in a population of 1417 middle aged men from
    Czech Republic.
  • Four tables are included in this data set
  • Entry data related to entry examinations
    performed on these men (the first step of the
    STULONG project).
  • Control data related to long-term observations.
  • Letter additional information about the health
    status of 403 men.
  • Death data related to the patients that became
    dead.

4
Basic Groups of Patients
  • The patients were classified into three basic
    groups, according to the results of the entry
    examinations
  • Normal Group men without the presence of any
    risk factor.
  • Risk Group men with the presence of one or more
    risk factors.
  • Pathologic Group men with either an identified
    cardiovascular disease or other serious disease.

5
Contribution
  • The main contribution of this work is to present
    strong association rules and exceptions mined
    from the Entry table.
  • The mining process was driven into discovering
    relations among the following characteristics of
    the patients in the basic groups
  • Social factors.
  • Physical activities during free time.
  • Alcohol consumption.
  • Smoking.
  • Results of the biochemical examinationsand the
    physical check-up.

6
Outline of the talk
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary

7
Multidimensional Association Rules
  • Multidimensional Association Rules (J. Han and M.
    Kamber, 2001) represent combinations of attribute
    values that often occur together in a database.
  • They can be mined from relational databases or
    data warehouses.
  • Example
  • (DailyBeerCons gt1l) ? (Smoking gt20
    cig/day)
  • meaning men who are heavy beer consumers tend
    to be also heavy smokers.
  • This rule involves two attributes (or
    dimensions) DailyBeerCons and Smoking.

8
Multidimensional Association Rules Formal
Definition
A1 a1 , ... , An an ? B1 b1 , ... , Bm bm
  • Ai (1 ? i ? n) and Bj (1 ? j ? m) distinct
    attributes (dimensions) from a database relation.
  • ai and bj values from the domains of Ai and Bj,
    respectively.
  • generic representation A ? B
  • A is the antecedent and B is the consequent of
    the rule. Several attributes can be involved in
    both the antecedent and the consequent.

9
Interest Measures Support and Confidence
  • Support index (Sup) the probability that a tuple
    matches all conditions in A ? B.
  • Confidence index (Conf) the probability that a
    tuple matches B, given that it matches A.
  • Sup(A ? B) P(A,B) and Conf(A ? B) P(BA).
  • The support indicates the relevance and the
    confidence indicates the validity of an
    association rule.
  • Support / Confidence Framework (Agrawal et al,
    1993) finding all rules that match user-provided
    minimum support and minimum confidence.

10
Interest Measures Support and Confidence
  • Problems with the Support / Confidence Framework
    (Brin et al, 1997)
  • generation of a huge number of rules
  • most of these rules are often obvious.
  • In many cases, these rules express relations that
    are not true.

11
Interest Measures Support and Confidence
Id Association Rule SupA SupB Sup Conf
R1 (DailyBeerCons gt1l) ? (Smoking gt20 cig/day) 0.1193 0.2602 0.0448 0.3758
R2 (DailyBeerCons gt1l) ? (Married yes) 0.1193 0.8487 0.0905 0.7584
  • The support and confidence values of R2 are
    higher than the R1 ones.
  • Is R2, in fact, more interesting than R1?

12
Negative Dependence
Id Association Rule SupA SupB Sup Conf
R2 (DailyBeerCons gt1l) ? (Married yes) 0.1193 0.8487 0.0905 0.7584
  • R2 should imply that men who are heavy beer
    consumers tend to be married.
  • 84.87 of men are married. However, the
    probability for a man to be married, given that
    he is a heavy beer consumer is 75.84.
  • Heavy beer consumers are, in fact, less likely to
    be married. There is a negative dependence
    between being married and being a heavy beer
    consumer.

13
Positive Dependence
Id Association Rule SupA SupB Sup Conf
R1 (DailyBeerCons gt1l) ? (Smoking gt20 cig/day) 0.1193 0.2602 0.0448 0.3758
  • 26.02 of men are heavy smokers. The probability
    for a man to be a heavy smoker, given that he is
    a heavy beer consumer is 37.58.
  • Heavy beer consumers are more likely to smoke a
    lot.
  • There is a positive dependence between being a
    heavy beer consumer and being a heavy smoker.

14
Strong Association Rule
Id Association Rule SupA SupB Sup Conf
R1 (DailyBeerCons gt1l) ? (Smoking gt20 cig/day) 0.1193 0.2602 0.0448 0.3758
R2 (DailyBeerCons gt1l) ? (Married yes) 0.1193 0.8487 0.0905 0.7584
  • Conclusions
  • R1 is a strong association rule, while R2 is not
    true.
  • In order to mine interesting information, we need
    to evaluate the type of dependence between the
    antecedent and the consequent of a rule.

15
Lift and RI
  • Lift how much more frequent is B when A occurs.
  • Lift(A ? B) Conf(A ? B) ? Sup(B)
  • RI - Rule Interest (G. Piatetsky-Shapiro, 1991)
    computes the percentage of additional tuples
    matched by an association rule that are above the
    expected.
  • RI(A ? B) Sup(A ? B) - Sup(A) x Sup(B)
  • We believe that the use of different interest
    measures (Sup, Conf, Lift and RI) provides
    alternative analysis of the same data, giving a
    better understanding about the associations.

16
Outline of the talk
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary

17
Exceptions
  • In our approach, exceptions represent association
    rules that become much weaker in some specific
    subsets of the database.
  • Example Does the rule (DailyBeerCons gt1l) ?
    (Smoking gt20 cig/day) become weaker on any
    subset of the database?

18
Exceptions
  • This exception was obtained because the
    conventional rule (DailyBeerCons gt1l) (Age
    ?50) ? (Smoking gt20 cig/day) did not
    achieve an expected support.
  • This expected support is evaluated from the
  • support of the original rule (DailyBeerCons
    gt1l) ? (Smoking gt20 cig/day) and the
    support of the condition (Age ?50).

19
Exceptions Formal Definition
  • Let D be a database relation.
  • Let R A ? B be a multidimensional association
    rule.
  • Let Z Z1 z1, ..., Zk Zk be a set of
    conditions defined over D, where Z ? A ? B ?. Z
    is named as probe set.
  • An exception related to the positive rule R is an
    implication of the form
  • A ? Z ? B

20
Candidate Exceptions
  • Exceptions are extracted from candidate
    exceptions. A candidate exception is an
    expression in the form
  • A ? Z ? B
  • Exceptions are mined only if the candidates do
    not achieve an expected support.
  • This expectation is evaluated based on the
    support of the original rule A ? B and the
    support of the conditions that compose the probe
    set Z
  • ExpSup(A ? Z ? B) Sup(A ? B) x Sup(Z)

21
The Interest Measure (IM) Index
  • We developed two interest measures to evaluate
    the degree of interestingness of an exception.
  • The IM (Interest Measure) index evaluates the
    strength (relevance) of an exception.
  • IM(E) 1 - (Sup(A ? Z ? B) ? ExpSup(A ? Z ? B))
  • An exception E is potentially interesting if the
    actual support value of Sup(A ? Z ? B) is much
    lower than its expected support value.
  • This measure captures the type of dependence
    between Z and A ? B. The closer the value is from
    1, the more the negative dependence.

22
Example of the IM Index
  • R (DailyBeerCons gt1l) ? (Smoking gt20
    cig/day) - Sup(R) 4.48
  • Z (Age ? 50) - Sup(Z) 22.82

23
Degree of Unexpectedness
  • A high value for the IM measure is not a
    guarantee that we found interesting information.

24
Degree of Unexpectedness
  • The DU (Degree of Unexpectedness ) Index is used
    to determine the validity of an exception.
  • This measure captures how much the negative
    dependence between a probe set Z and a rule A ? B
    is higher than the negative dependence between Z
    and either A and B.
  • DU(E) IM(E) - max(1 - Sup(A ? Z) ? ExpSup(A ?
    Z),
  • 1 - Sup(B ? Z) ? ExpSup(B ? Z))
  • The greater the value is from 0, the more
    interesting the exception will be. If DU(E) ? 0
    the exception is uninteresting.

25
Example of the DU Index
  • R (DailyBeerCons gt1l) ? (Smoking gt20
    cig/day)
  • Sup(R) 4.48 --- Sup(A) 11.93 --- Sup(B)
    26.02
  • Z (Age ? 50)
  • Sup(Z) 22.82 --- Sup(A ? Z) 2.00 ---
    Sup(B ? Z) 6.00

26
Outline of the talk
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary

27
Data Preparation
  • The following relations in the ARFF format
    (Witten and Frank, 2000) were generated from the
    original Entry table
  • ENTRYTOT 1249 tuples
  • (men from groups A, B and C).
  • ENTRYA 276 tuples (only men from group A).
  • ENTRYB 859 tuples (only men from group B).
  • ENTRYC 114 tuples (only men from group C).

28
Data Preparation
  • Data was enriched with new fields and the
    continuous attributes were discretized.

Field Possible Values
Cholesterol desirable (lt200), bordering (200 239), high (? 240).
Triglycerides desirable (lt150), bordering (150 200), high (201 - 499), very high (? 500).
BMI (body mass index) underweight ( bmi lt 20), normal (20 ? bmi lt 25), overweight (25 ? bmi lt 30), obese (30 ? bmi lt 40), morbidly obese (bmi ? 40).
Blood Pressure normal, normal / high, high
Skin Folds 8-20, 21-30, 31-40, gt40
Age 38-39, 40-44, 45-49, ? 50
29
Outline of the talk
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary

30
Results
  • We developed two programs in C (g compiler)
  • MULTMINE used to mine strong multidimensional
    association rules.
  • EXCEPMINE used to mine exceptions.
  • We use the following thresholds on the
    experiments
  • Minimum support 1 (MULTMINE).
  • Minimum IM 0.30 and minimum DU 0.05
    (EXCEPMINE).

31
Group A - EntryALL
  • (Group A) ? (Education university)

SupA SupB Sup Conf Lift RI
0.2210 0.2762 0.0873 0.3949 1.430 0.0262
  • Group A is the only one where men with university
    degree are in the majority (Conf 0.3949).
  • (Group A) ? (PhysActAfterJob great
    activity)

SupA SupB Sup Conf Lift RI
0.2210 0.0857 0.0320 0.1449 1.692 0.0131
  • There is a strong positive dependence between
    belonging to Group A and practicing physical
    actvities intensely in free time (lift 1.692).

32
Alcohol Consumption x Smoking
  • (DailyBeerCons gt1l) ? (SmokingDuration gt20
    years)

Group SupA SupB Sup Conf Lift RI
A 0.0688 0.1667 0.0145 0.2105 1.263 0.0030
B 0.1362 0.5751 0.0908 0.6667 1.159 0.0125
C 0.1140 0.4737 0.0789 0.6923 1.461 0.0249
  • Drinking a lot and smoking for more than 20 years
    are positively dependent in groups A, B, and C
    (Lift and RI columns).
  • However, there are much fewer smokers in Group A
    (SupB column). In groups B and C, the greatest
    part of the heavy beer consumers smoked
    cigarettes for more than 20 years (Conf column).
  • Men from group B tend to smoke and drink more
    (SupA, SupB and Sup columns).

33
Alcohol Consumption x Cholesterol
  • (Alcohol No) ? (Cholesterol desirable)

Group SupA SupB Sup Conf Lift RI
A 0.0870 0.3370 0.0507 0.5833 1.731 0.0214
B 0.0861 0.1828 0.0186 0.2162 1.183 0.0029
C 0.1316 0.1316 0.0263 0.2000 1.520 0.0090
  • Not drinking alcohol and having the cholesterol
    in the desirable range are positively dependent
    in groups A, B, and C (Lift and RI columns).
  • There are less alcohol consumers in Group C (SupA
    column).
  • In group A, the greatest part of the men who do
    not drink alcohol have the cholesterol in the
    desirable range (Conf column).

34
Education x Smoking
  • (Education university) ? (Smoking no)

Group SupA SupB Sup Conf Lift RI
A 0.3949 0.5109 0.2210 0.5596 1.095 0.0193
B 0.2526 0.1793 0.0664 0.2627 1.465 0.0211
C 0.1667 0.2018 0.0877 0.5263 2.608 0.0541
  • People with the highest education degree are less
    likely to be smokers (Lift and RI columns).
  • In groups A and C, the majority of men with
    university degree do not smoke (Conf column). The
    support of this rule is very high in group A.
  • In group B, most of them are smokers (Conf
    column). However, not smoking and having reached
    university degree still are very positively
    dependent (Lift and RI columns).

35
Skin Folds x Body Mass Index
  • (Skin Folds ? 20) ? (BMI normal)

Group SupA SupB Sup Conf Lift RI
A 0.2319 0.5326 0.1558 0.6719 1.261 0.0323
B 0.2154 0.3586 0.1478 0.6865 1.914 0.0706
C 0.1140 0.2632 0.0789 0.6923 2.631 0.0489
  • Most of the men who have the body mass index into
    the normal range were classified into the lowest
    range of the attribute Skin Folds (Conf column).
  • Both attributes are highly positive dependent
    (Lift and RI columns).
  • There are much fewer people who have normal BMI
    in Group C (SupB column).

36
Exceptions
  • (Education apprentice school )
  • (PhysActAfterJob great act.) ? (Smoking
    15-20 cig day)
  • IM 0.4755, DU 0.2069
  • Original rule people whose education degree is
    apprentice school tend to smoke a lot.
  • Exception Among the men who practice physical
    activities intensely in free time, the support
    value of the original rule is 47.55 smaller than
    what is expected.
  • The degree of unexpectedness is equal to 20.69.

37
Exceptions
  • (Education university ) (Group C) ?
    (BMI normal)
  • IM 0.7018, DU 0.3052
  • Original rule people with the highest education
    degree tend to have the body mass index into the
    normal range.
  • Exception Among the men who belong to Group C,
    the support value of the original rule is 70.18
    smaller than what is expected.
  • The degree of unexpectedness is equal to 30.52.

38
Outline of the talk
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary
  • Atherosclerosis Data Set
  • Multidimensional Association Rules
  • Exceptions
  • Data Preparation
  • Results
  • Summary

39
Summary
  • We presented some strong association rules and
    exceptions mined from the STULONG Data Set,
    concerning the entry examinations.
  • Strong association rules evaluated the
    differences of the correlations concerning the
    characteristics of the patients from the three
    basic groups.
  • Exceptions indicated negative patterns associated
    with previously known strong positive rules.
    These exceptions were mined from candidates that
    do not achieve an expected support value.

40
Future Work
  • Apply the same approach to the relations Letter,
    Control and Death.
  • Besides mining rules with large deviation between
    the actual and the expected support, we intend to
    investigate the interestingness of rules with
    large deviation between the actual and the
    expected confidence value.

41
Universidade ? Federal Fluminense
Universidade Federal Fluminense http//www.uff.br
Niterói, Rio de Janeiro, Brazil
Thank ? you !!
Write a Comment
User Comments (0)
About PowerShow.com