Christoph F. Eick, Walter D. Sanz, and Ruijian Zhang - PowerPoint PPT Presentation

About This Presentation
Title:

Christoph F. Eick, Walter D. Sanz, and Ruijian Zhang

Description:

A Genetic Programming System for Building Block Analysis to Enhance Data Analysis and Data Mining Techniques Christoph F. Eick, Walter D. Sanz, and Ruijian Zhang – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 28
Provided by: eic96
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Christoph F. Eick, Walter D. Sanz, and Ruijian Zhang


1
A Genetic Programming System for Building Block
Analysis to Enhance Data Analysis and Data Mining
Techniques
  • Christoph F. Eick, Walter D. Sanz, and Ruijian
    Zhang
  • www.cs.uh.edu/ceick/matta.html
  • University of Houston
  • Organization
  • 1. Introduction
  • 2. General Approach and Corresponding
    Methodology
  • 3. Wols --- A Genetic Programming System to
    Find Building Blocks
  • 4. Experimental Results
  • 5. Summary and Conclusion

2
1. Introduction
  • Regression analysis, which is also know as curve
    fitting, may be broadly defined as the analysis
    of relationships among variables. The actual
    specification of the form of the function weather
    it be linear, quadratic, or polynomial, is left
    up entirely to the user.
  • Symbolic regression is the process of discovering
    both the functional form of a target function and
    all of its necessary coefficients. Symbolic
    regression involves finding a mathematical
    expression in symbolic form, that provides a
    good, best or perfect fit between a finite
    sampling of values of the input variables and the
    associated values of the dependent variables.
  • Genetic Programming simulates the Darwinian
    evolutionary process and naturally occurring
    genetic operations on computer programs. Genetic
    programming provides a way to search for the
    fittest computer program in a search space which
    consists of all possible computer programs
    composed of functions and terminals appropriate
    to the problem domain.

3
Example -- Evolving Syntax Trees with Genetic
Programming
4
Motivating Example
Task Approximate boundaries between classes with
rectangles
C1
C1
C2
C2
Easy!
Difficult!
Idea Transform the representation space so that
the approximation problem becomes simpler (e.g.
instead (x,y), just use (x-y)). Problem How do
we find such representation space transformations?
5
Building Blocks
  • Definition Building Blocks are patterns which
    occur with a high frequency in a
  • set of solutions.
  • Example
  • (- ( C D) ( A B))
  • (- C ( B A))
  • (- ( A B) (COS A))
  • (- ( C (/ E (SQRT D))) ( A B))
  • Leaf Building Block ( A B)
  • Root Originating Building Block (- ? ( A B))
  • FUNC1 ( A B)

6
Constructive Induction
  • Definition Constructive Induction is a general
    approach for coping with inadequate attributes,
    redundant, and useless attributes found in
    original data. Constructive Induction explores
    various representation spaces. Building block
    analysis and constructive induction are used in
    this research to transform the original data by
    adding new attributes and by dropping useless
    ones.
  • Example

7
2. General Approach andCorresponding Methodology
  • Primary Learning System the one we use for the
    task at hand (e.g. C5.0)
  • Secondary Learning System is used to find
    building blocks
  • General Steps of the Methodology
  • 0. Run the primary learning system for the data
    collection. Record prediction performance and
    good solutions.
  • 1. Run the genetic programming/symbolic
    regression/building block analysis system
    (secondary learning system) several times for the
    data collection with a rich function set. Record
    prediction performance and good solutions.
  • 2. Analyze the good solutions found by the
    secondary learning system for the occurrence of
    building blocks.
  • 3. Using the results of step 2, transform the
    data collection by generating new attributes.
  • 4. Redo steps 0-3 for the modified data
    collection, as long as a significant improvement
    occurs

8
3. Wols --- A Genetic Programming System that
finds Bulding Blocks
9
Wols Major Inputs
10
Features of the Wols System
  • The WOLS system is basically a modified version
    of the standard genetic program model. A
    standard genetic program model employs crossover
    and mutation operators that evolve solutions
    (here approximation functions) relying on the
    principles of the survival of the fittest better
    solutions reproduce with a higher probability.
  • The WOLS system employs the traditional sub-tree
    swapping crossover operator, but provides a
    quite sophisticated set of 7 mutation operators,
    that apply various random changes to solutions.
  • The goal of the WOLS system is to find good
    approximations for an unknown function ƒ, and to
    analyze the decomposition of good approximations
    with the goal to identify building blocks.
  • The front end of the WOLS system takes a data
    collection DC(A1,,An, B) as its input and tries
    to find good approximations h, where
    Bh(A1,,An), of the unknown function f with
    respect to a given function set and a given error
    function.
  • The symbolic regression genetic programming
    component (SRGP) of the WOLS system includes an
    interactive front end interface which allows a
    user to easily generate a genetic program that
    employs symbolic regression and is tailor made to
    the problem he wishes to solve.

11
Features of the Wols System (cont.)
  • The system run starts by asking the user
    interactively for parameters needed to run the
    symbolic regression system. The information the
    user provides is then used to generate the
    functions for the specific problem he wishes to
    solve. These problem specific functions are then
    merged with a set of general genetic programming
    functions to form a complete problem specific
    genetic program.
  • The SRGP component allows the user to select
    between three different error functions
  • Manhattan error function (minimizes the absolute
    error in the output variable)
  • Least Square error function (minimizes the
    squared error in the output variable)
  • Classification error function (minimizes the
    number of misclassifications)
  • The WOLS system is capable of solving both
    regression analysis and classification problems.
    In regression analysis problems the goal is to
    minimize the error, while classification problems
    the goal is to minimize the number of
    misclassifications.

12
Features of the Wols System (cont.)
  • The goal of the decision block analysis tool is
    to identify frequently occurring patterns in sets
    of solutions. In our particular approach,
    building blocks are expressions that involve
    functions symbols, variable symbols, as well as
    the ? (dont care) symbol.
  • For example, if DC(A1, A2, A3, B) is a data
    collection the variable A1, the expression (A2
    A3), and expression ((sin A1) (A2 ?)) are all
    potential building blocks.
  • In general we are interested in finding patterns
    that occur in about 85 or more of the set of
    analyzed approximations.
  • The identification of building blocks is useful
    for both improving the attribute space and in
    determining which part of the search space has
    been explored.

13
Decision Block Analysis Functions
  • Root Analysis Finds common root patterns among a
    population of solution trees.
  • General Analysis Finds common patterns anywhere
    in a solution tree.
  • Leaf Analysis Finds common patterns at the leaf
    nodes of solution trees.
  • Frequency Analysis Finds the total number of
    times a specific operator,variable,
  • or constant is
    used in the population of solution trees.

14
4. Experimental Results
  • Experiment 1- 5 variable problem with 30 points
    training data
  • Experiment 2 - 5 variable problem with 150
    points training data
  • Experiment 3 - Simple Linear Regression
  • Experiment 4 - Time for a Hot Object To Cool
  • Experiment 5 - Time for a Hot Object To Cool
    Using Constructive Induction
  • Experiment 6 - Glass Classification
  • Experiment 7 - Learning An Approximation
    Function

15
Experiment 7 - Learning An Approximation Function
  • f(a, b, c, d, e) b-d-0.44log(a1)
    ab(1-log(2))
  • Random values were plugged into function f in
    order to generate test cases of the form (a, b,
    c, d, e, f(a, b, c, d, e))
  • training data set of 150 test cases, testing data
    set size 150 test cases
  • Goal was to find a function h(a, b, c, d, e) that
    best approximates function f.
  • function set , , loga(x) log(x),
    cosa(x)cos(x), sqrta(x)sqrt(x),
    por(a,b) ab-ab, diva(a/b) a/b,
    aminus(a,b) a-b, expa(x) emin(x 300),
    sin100a sin(100x)
  • variable set a, b, c, d, e

16
Experiment 7 - continued
  • Fitness Function Average Absolute Manhattan
    Distance
  • where
  • n total number of test cases
  • xi output generated by the solution found
    through genetic programming
  • yi output generated by the known function
    (desired output)
  • RUN 1 - Direct Approach
  • RUN 2 - Using Leaf Building Blocks
  • RUN 3 - Using Leaf Building Blocks and Reduced
    Operator and Variable Sets
  • For each run a population size of 200 was used.
  • Each run was terminated after 100 generations.

17
RUN 1 - Direct Approach
  • Average Absolute Manhattan Distance Error
  • Training 0.01102
  • Testing 0.01332
  • Leaf Building Blocks In The Top 10 Of The
    Population
  • Pattern ( B A) occurred 25 times in 100.0 of
    the solutions.
  • Pattern (AMINUS 0.41999999999999998 D) occurred
    32 times in 95.0 of the solutions.
  • Pattern (EXPA (AMINUS 0.41999999999999998 D))
    occurred 28 times in 95.0 of the solutions.
  • Frequency Analysis In The Top 10 Of The
    Population
  • was used 9 times in 25.0 of population
  • aminus was used 56 times in 100.0 of population
  • was used 176 times in 100.0 of
    population
  • / was used 75 times in 100.0 of
    population
  • por was used 111 times in 100.0 of
    population
  • sqrta was used 63 times in 100.0 of
    population

18
  • Frequency Analysis (continued)
  • sin100a was used 3 times in 10.0 of population
  • cosa was used 1 times in 5.0 of the population
  • loga was used 2 times in 5.0 of population
  • expa was used 33 times in 100.0 of
    population
  • A was used 76 times in 100.0 of
    population
  • B was used 132 times in 100.0 of
    population
  • C was used 9 times in 35.0 of population
  • D was used 57 times in 100.0 of
    population
  • E was used 21 times in 40.0 of
    population
  • CONSTANT was used 152 times in 100.0 of
    population
  • Leaf Building Blocks ( B A), (AMINUS
    0.41999999999999998 D), and (EXPA (AMINUS
    0.41999999999999998 D))
  • Since variables C and E are the only two
    variables not present in 100 of the solutions
    analyzed they will be eliminated in RUN 3 along
    with the least used operator cosa. Also notice
    that variables C and E were not actually present
    in function f and were successfully filtered out.

19
RUN 2 - Using Leaf Building Blocks
  • Solution
  • ( ADF1
  • (SQRTA ( (SQRTA (AMINUS (SQRTA (POR B
  • (SQRTA
  • (AMINUS
  • (
  • ( B
  • (POR
    0.93000000000000005
  • (AMINUS
    ( ADF1 ADF1) ADF1)))
  • (POR
  • (AMINUS
  • (POR
  • (SQRTA

  • (AMINUS
  • (
    0.91000000000000003

  • (POR A

  • (EXPA

  • (AMINUS

  • (POR ADF2
  • Average Absolute Manhattan Distance Error
  • Training 0.00886
  • Testing 0.01056
  • In RUN 2 there is a 19.6 increase in
    accuracy in training and a 20.7 improvement in
    testing over RUN 1.

20
Best Solution Run3
  • Solution
  • (DIVA ADF1
  • (EXPA ( (LOGA ADF3)
  • (AMINUS ( (DIVA (POR
    0.23999999999999999 A)
  • (POR B
  • (LOGA
  • (POR ADF1
  • (POR
  • (POR
    (EXPA A)
  • (EXPA
  • (LOGA
  • (POR
    (DIVA ADF3 B)

  • (AMINUS ADF3 B)))))
  • ( (EXPA
    0.95000000000000007)
  • A))))))
  • ADF2)
  • ADF1))))
  • Where
  • ADF1 ( B A)

21
RUN 3 - Using Leaf Building Blocks and Reduced
Operator and Variable Sets
  • Solution
  • (DIVA ADF1
  • (EXPA ( (LOGA ADF3)
  • (AMINUS ( (DIVA (POR
    0.23999999999999999 A)
  • (POR B
  • (LOGA
  • (POR ADF1
  • (POR
  • (POR
    (EXPA A)
  • (EXPA
  • (LOGA
  • (POR
    (DIVA ADF3 B)

  • (AMINUS ADF3 B)))))
  • ( (EXPA
    0.95000000000000007)
  • A))))))
  • ADF2)
  • ADF1))))
  • Where
  • Average Absolute Manhattan Distance Error
  • Training 0.00825
  • Testing 0.00946
  • In RUN 3 there is a 7.0 increase in
    accuracy in training and a 10.4 improvement in
    testing over RUN 2 and a 25.1 increase in
    accuracy in training and a 29.0 improvement in
    testing over RUN 1.

22
5. Summary and Conclusion
  • The WOLS system is a tool whose objective is to
    find good approximations for an unknown function
    ?, and to analyze the decomposition of good
    approximations with the goal to identify building
    blocks.
  • Our experiments demonstrated that significant
    improvements in the predictive accuracy of the
    function approximations obtained can be achieved
    by employing building block analysis and
    constructive induction. However, more experiments
    are needed to demonstrate the usefulness of the
    WOLS-system.
  • The symbolic regression/genetic programming
    front-end (SRGP) did quite well for a number of
    regression system benchmarks. This interesting
    observation has to be investigated in more detail
    in the future.
  • The WOLS system was written in GCL Common LISP
    Version 2.2. All Experiments were run on an
    Intel 266 MHZ Pentium II with 96 MB RAM.

23
Future Work
  • Improving the WOLS systems performance in the
    area of classification problems.
  • Conduct more experiments that use the WOLS
    system as a secondary learning system, whose goal
    is to search for good building blocks useful for
    improving the attribute space for other inductive
    learning methods such as C4.5.
  • Construct a better user interface, such as a
    graphical user interface.
  • Although the empirical results obtained from the
    large number of experiments involving regression
    were encouraging more work still needs to be done
    in this area as well.

24
Experiment 3 - Simple Linear Regression
  • In this experiment the WOLS system, NLREG, and
    Matlab were all used to obtain an approximation
    function for a simple linear regression problem.
  • NLREG, like most conventional regression analysis
    packages, is only capable of finding the numeric
    coefficients for a function whose form (i.e.
    linear, quadratic, or polynomial) has been
    prespecified by the user. A poor choice, made by
    the user, of the functions form will in most
    cases lead to a very poor solution which would
    not be truly representative of the programs
    ability to solve the problem. In order to avoid
    this problem, for this experiment both the data
    set and the actual functional form, from one of
    the example problems provided with this software,
    were used.
  • In Matlab there are basically two different
    approaches available to perform regression
    analysis. The quickest method, which was used in
    this experiment, is to use the command polyfit(x,
    y, n), which instructs Matlab to fit an nth order
    polynomial to the data. The second method is
    similar to NLREG and requires the functional form
    to be specified by the user.
  • training data set of 30 test cases testing data
    set size 15 test cases
  • The following parameters were used for the WOLS
    system
  • Population Size 500
  • Generations 200
  • function set , , loga(x) log(x),
    cosa(x)cos(x), sqrta(x)sqrt(x),
    por(a,b) ab-ab, diva(a/b) a/b,
    aminus(a,b) a-b, expa(x) emin(x 300),
    sin100a sin(100x)
  • variable set a

25
  • NLREG Solution
  • 1 Title "Linear equation y p0 p1x"
  • 2 Variables x,y // Two
    variables x and y
  • 3 Parameters p0,p1 // Two
    parameters to be estimated p0 and p1
  • 4 Function y p0 p1x // Model of
    the function
  • Stopped due to Both parameter and relative
    function convergence.
  • Proportion of variance explained (R2) 0.9708
    (97.08)
  • Adjusted coefficient of multiple determination
    (Ra2) 0.9698 (96.98)
  • ---- Descriptive Statistics for
    Variables ----
  • Variable Minimum value Maximum value Mean
    value Standard dev.
  • ---------- -------------- -------------- --------
    ------- -------------
  • x 1 8
    4.538095 2.120468
  • y 6.3574 22.5456
    14.40188 4.390238
  • ---- Calculated Parameter
    Values ----

26
  • Matlab Solution
  • 1st order polynomial(n 1)
  • y 2.0400x 5.1442
  • Training 0.6361
  • Testing 0.7438
  • 5th order polynomial(n 5)
  • y 0.0070x5 - 0.1580x4 1.3365x3 - 5.2375x2
    11.5018x - 1.0632
  • Training 0.5838
  • Testing 0.8034
  • 10th order polynomial(n 10)
  • y 0.0004x10 - 0.0187x9 0.3439x8 - 3.5856x7
    23.2763x6- 97.5702x5 265.0925x4 -
  • 456.1580x3 469.7099x2 - 254.6558x
    59.9535
  • Training 0.4700
  • Testing 0.8669
  • In viewing the results above, one the first
    things that is noticed is that as the order of
    the polynomial increases, the training
    performance also increases, but the testing
    performance decreases. This is because the higher
    order polynomials tend emulated the training set
    data too closely which in return causes a poorer
    performance when the testing data set values are
    applied.

27
WOLS Solution
  • ( (POR ( X X) 0.080000000000000002)
  • ( ( (DIVA 0.95000000000000007 0.38)
  • ( (DIVA (LOGA ( (POR (EXPA (POR
    0.94000000000000006
  • (LOGA
  • (COSA
  • (
  • (DIVA
  • (

  • (EXPA

  • (DIVA
  • (

  • (EXPA

  • ( 0.68000000000000005

  • 0.13))

  • 0.51000000000000001)

  • 0.38))

  • 0.51000000000000001)
  • 0.38)
  • X)))))
  • Average Absolute Manhattan Distance
  • Training 0.3788
  • Testing 0.6906
  • Runtime wise the WOLS system cannot compete with
    NLREG or Matlab each of which found a solution in
    around 1 second, while the WOLS system spent
    nearly 4 hours generating its solution.
  • The WOLS system produced a solution that had 40
    better training performance and a 7 better
    testing performance then both the NLREG and
    Matlab solutions.
  • This solution was obtained without using building
    block analysis and constructive induction.
Write a Comment
User Comments (0)
About PowerShow.com