Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Classification

Description:

Classification * – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 23
Provided by: Gregor273
Category:

less

Transcript and Presenter's Notes

Title: Classification


1
  • Classification

2
Classification
  • Task Given a set of pre-classified examples,
    build a model or classifier to classify new
    cases.
  • Supervised learning classes are known for the
    examples used to build the classifier.
  • A classifier can be a set of rules, a decision
    tree, a neural network, etc.
  • Typical applications credit approval, direct
    marketing, fraud detection, medical diagnosis,
    ..

3
Simplicity first
  • Simple algorithms often work very well!
  • There are many kinds of simple structure, eg
  • One attribute does all the work
  • All attributes contribute equally independently
  • A weighted linear combination might do
  • Instance-based use a few prototypes
  • Use simple logical rules
  • Success of method depends on the domain

4
Inferring rudimentary rules
  • 1R learns a 1-level decision tree
  • I.e., rules that all test one particular
    attribute
  • Basic version
  • One branch for each value
  • Each branch assigns most frequent class
  • Error rate proportion of instances that dont
    belong to the majority class of their
    corresponding branch
  • Choose attribute with lowest error rate
  • (assumes nominal attributes)

5
Pseudo-code for 1R
For each attribute, For each value of the attribute, make a rule as follows count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate
  • Note missing is treated as a separate
    attribute value

6
Evaluating the weather attributes
Attribute Rules Errors Total errors
Outlook Sunny ? No 2/5 4/14
Overcast ? Yes 0/4
Rainy ? Yes 2/5
Temp Hot ? No 2/4 5/14
Mild ? Yes 2/6
Cool ? Yes 1/4
Humidity High ? No 3/7 4/14
Normal ? Yes 1/7
Windy False ? Yes 2/8 5/14
True ? No 3/6
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
  • indicates a tie

7
Dealing withnumeric attributes
  • Discretize numeric attributes
  • Divide each attributes range into intervals
  • Sort instances according to attributes values
  • Place breakpoints where the class changes(the
    majority class)
  • This minimizes the total error
  • Example temperature from weather data

Outlook Temperature Humidity Windy Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes

64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
8
The problem of overfitting
  • This procedure is very sensitive to noise
  • One instance with an incorrect class label will
    probably produce a separate interval
  • Also time stamp attribute will have zero errors
  • Simple solutionenforce minimum number of
    instances in majority class per interval

9
Discretization example
  • Example (with min 3)
  • Final result for temperature attribute

64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
10
With overfitting avoidance
  • Resulting rule set

Attribute Rules Errors Total errors
Outlook Sunny ? No 2/5 4/14
Overcast ? Yes 0/4
Rainy ? Yes 2/5
Temperature ? 77.5 ? Yes 3/10 5/14
gt 77.5 ? No 2/4
Humidity ? 82.5 ? Yes 1/7 3/14
gt 82.5 and ? 95.5 ? No 2/6
gt 95.5 ? Yes 0/1
Windy False ? Yes 2/8 5/14
True ? No 3/6
11
  • Missing Values

12
Missing Values
  • Many data sets are plagued by the problem of
    missing values
  • missing values can be a result of manual data
    entry, incorrect measurements, equipment errors,
    etc.
  • they are usually denoted by special characters
    such as
  • NULL
  • ?

13
Table 2.1
14
Missing Values
  • imputation (filling-in) of missing data
  • We will use two ways of single imputation
  • Single Imputation
  • Hot Deck Imputation

15
Missing Values
  • single imputation
  • mean imputation method uses the mean of values of
    a feature that contains missing data
  • in case of a symbolic/categorical feature, a mode
    (the most frequent value) is used
  • the algorithm imputes missing values for each
    attribute separately

16
Table 2.2
17
Missing Values
  • - single imputation
  • hot deck imputation for each object that
    contains missing values the most similar object
    (according to some distance function) is found,
    and the missing values are imputed from that
    object
  • if the most similar record also contains missing
    values for the same feature then it is discarded
    and another closest object is found
  • the procedure is repeated until all the missing
    values are imputed
  • when no similar object is found, the closest
    object with the minimum number of missing values
    is chosen to impute the missing values

18
Table 2.3
19
Noise
20
Noise
  • Def. Noise in the data is defined as a value
    that is a random error or variance in a measured
    feature
  • the amount of noise in the data can jeopardize
    the entire KDP results
  • the influence of noise on the data can be
    prevented by imposing constraints on features to
    detect anomalies when the data is entered
  • for instance, DBMS usually provides facility to
    define constrains for individual attributes

21
Noise Detection
  • In manual inspection, the user checks feature
    values against predefined constraints and
    manually detects the noise
  • For example, for object 5 in table 2.3 , the
    cholesterol value is 45.0, which is outside the
    predefined acceptable interval for this feature,
    namely, within 50.0, 600.0.

22
Noise
  • Noise can be removed using
  • Binning
  • Requires ordering values of the noisy feature and
    then substituting the values with a mean or
    median value for predefined bins
  • In table 2.3, the attribute of Cholesterol
    contains the value of 45 which is a noise.
    Binning first orders the values of the noisy
    feature and then replaces the values with a mean
    or median value for the predefined bins. As an
    example, let us consider the cholesterol feature,
    with its values 45.0, 261.2, 331.2, and 407.5. If
    the bin size equals two, two bins are created
    bin1 with 45.0 and 261.2, and bin2 with 331.2 and
    407.5. For bin1 the mean value is 153.1, and for
    bin2 it is 369.4. Therefore the values 45.0 and
    261.2 would be replaced with 153.1 and the values
    331.2 and 407.5 with 369.4. Note that the two new
    values are within the acceptable interval.
Write a Comment
User Comments (0)
About PowerShow.com