Data Preparation Using the Genetic Algorithm - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Data Preparation Using the Genetic Algorithm

Description:

none – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 13
Provided by: hao64
Category:

less

Transcript and Presenter's Notes

Title: Data Preparation Using the Genetic Algorithm


1
Data Preparation Using the Genetic Algorithm
  • June 9, 2003
  • By Hao Lac

2
Outline
  • Background
  • Motivation (Why use the Genetic Algorithm (GA)?)
  • Genetic Algorithm for Attribute Selection
  • Standard Individual Encoding for Attribute
    Selection
  • Attribute Index-Based Individual Encoding for
    Attribute Selection
  • Fitness Evaluation
  • Attribute Weighting
  • Project Proposal

3
Background
  • Garbage in, Garbage out.
  • Improve the quality of the data being mined, to
    facilitate the application of a data mining
    algorithm.
  • Problem of dimension reduction to reduce
    redundancy and increase computational speed.
  • Two approaches
  • Wrapper approach
  • Filter approach

4
Background (Cont)
5
Why Use the GA?
  • Does not assume monotonicity of the measure used
    to evaluate the quality of a candidate attribute
    subset, typically a measure of predictive
    accuracy using the wrapper approach.
  • Designed to cope with multiple criteria for
    evaluating a candidate attribute subset, such as
    predictive accuracy, number of selected
    attributes, etc.

6
Genetic Algorithm for Attribute Selection
  • How to encode candidate solutions selected
    attribute subsets into a GA individual?
  • What genetic operators to use for the
    corresponding individual encoding?
  • What fitness functions to use?

7
Standard Individual Encoding for Attribute
Selection
  • Each state in the search space is represented by
    fixed-length binary string containing m bits.
  • m is the number of available attributes.
  • For example, 0 1 0 1 0 0 0 0 0 1 means that
    attributes A2, A4, and A10 are selected.
  • Can use conventional genetic operators, but is
    not the best.
  • Chen et al. 1999 and Guerra-Salcedo et al. 1999
    proposed a heuristic method called
    commonality-based crossover to deal with weakly
    relevant attributes.
  • Attribute weighting is another method used.

8
Attribute Index-Based Individual Encoding for
Attribute Selection
  • Cherkauer and Shavlik 1996 proposed an individual
    representation that consists of N genes, where
    each gene can contain either the index (id) of an
    attribute or a flag say 0 indicating no
    attribute.
  • N is a user-specified parameter.
  • For example, consider an individual where N 5
    0 A1 A4 0 A1. Here, only attributes A1 and A4
    are selected.
  • The fact that A1 appears twice is irrelevant when
    decoded into a subset of selected attributes.

9
Attribute Index-Based Individual Encoding for
Attribute Selection (Cont)
  • Attribute index appearing twice increases
    robustness and slows the loss of genetic
    diversity.
  • Independence of the individuals length to that
    of the number of original attributes makes this
    approach more scalable to large number of
    attributes.
  • Requires a new genetic operator called Delete
    Attribute in addition to the conventional
    operators.

10
Attribute Index-Based Individual Encoding for
Attribute Selection (Cont)
  • Delete Attribute takes as input one parent and
    produces as output one offspring.
  • The offspring is produced by selecting with
    equiprobability an attribute from the parent and
    deleting all occurrences of the selected
    attributes in the resulting offspring.
  • Delete Attribute favours the selection of smaller
    attribute subsets.

11
Fitness Evaluation
  • The fitness function of a GA for attribute
    selection involves a measure of performance of
    classification algorithm using only the subset of
    attributes selected by the corresponding GA
    individual.
  • Distinction between wrapper and filter approaches
    get fuzzy when filter-oriented criterion such as
    the number of selected attributes are included in
    the fitness function.
  • The nature of the fitness function used
    determines whether one is selecting attributes
    for the purposes of classification or clustering.

12
Project Proposal
  • To implement a data preparation system based on
    the Genetic Algorithm using the wrapper approach.
  • To compare index-based individual encoding for
    attribute selection and a traditional attribute
    selection method based on a common fitness
    function.
  • Comparison will be formalized by using
    statistical analysis.
Write a Comment
User Comments (0)
About PowerShow.com