Friday, Febuary 2, 2001 - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Friday, Febuary 2, 2001

Description:

CIS 830: Advanced Topics in Artificial Intelligence. KSU. Friday, Febuary 2, 2001 ... If sufficiently long running period is allowed and a good random function is ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 28

Provided by: lindajacks

Learn more at: https://www.kddresearch.org

Category:

more less

Transcript and Presenter's Notes

Title: Friday, Febuary 2, 2001

1
Presentation
Aspects Of Feature Selection for KDD
Friday, Febuary 2, 2001 PresenterAjay
Gavade Paper 2 Liu and Motoda, Chapter 3
2
Outline

Categories of Feature Selection Algorithms
Feature Ranking Algorithms
Minimum Subset Algorithms
Basic Feature Generation Schemes Algorithms
How do we generate subsets?
Forward, backward, bidirectional, random
Search Strategies Algorithms
How do we systematically search for a good
subset?
Informed Uninformed Search
Complete search
Heuristic search
Nondeterministic search
Evaluation Measure
How do we tell how good a candidate subset is?
Information gain, Entropy.

3
The Major Aspects Of Feature Selection

Search Directions (Feature Subset Generation)

Search Strategies

Evaluation Measures

A Particular method of feature selection is a
combination of some possibilities of every
aspect. Hence each method can be represented by a
point in the 3-D structure.

4
Major Categories of Feature Selection Algorithms
(From The Point Of View Of Methods Output)

Feature Ranking Algorithms

These algorithms return a ranked list of features
ordered according to some evaluation measure. The
algorithm tells the importance (relevance) of a
feature compared to others.

5
Major Categories of Feature Selection Algorithms
(From The Point Of View Of Methods Output)

Minimum Subset Algorithms

These algorithms return a minimum feature subset
, and no difference is made for features in the
subset. Theses algorithms are used when we dont
know the number of relevant features.

6
Basic Feature Generation Schemes

Sequential Forward Generation

Starts with empty set and adds features from the
original set sequentially. Features are added
according to relevance.

N-step look-ahead form.

One -step look-ahead form is the most commonly
used schemes because of good efficiency

A minimum feature subset or ranked list can be
obtained.
Can deal with noise in data.

7
Basic Feature Generation Schemes

Sequential Backward Generation

Starts with full set and removes one feature at a
time from the original set sequentially. Least
relevant feature is removed.

But this tells nothing about the ranking of the
relevant features remaining.
Doesn't guarantee absolute minimal subset.

8
Basic Feature Generation Schemes

Bidirectional Generation

This runs SFG and SBG in parallel, and stops when
one algorithm finds a satisfactory subset.

Optimizes the speed if number of relevant
features is unknown.

9
Basic Feature Generation Schemes

Random Generation

Sequential Generation Algorithms are fast on
average, but they cant guarantee absolute
minimum valid set i.e. optimal feature subset.
Because if they hit a local minimum (a best
subset at the moment) they have no way to get
out.

Random Generation scheme produces subset at
random. A good random number generator is
required so that every combination of features
ideally has a chance to occur and occurs just
once.

10
Search Strategies

Exhaustive Search

Exhaustive search is complete since it covers all
combinations of features. But a complete search
may not be exhaustive.

Depth-First Search

This search goes down one branch entirely, and
then backtracks to another branch.This uses stack
data structure (explicit or implicit)

11
Depth-First Search
Depth-First Search 3 features a,b,c
a
b
c
a, b
a,c
b,c
a,b,c
12
Search Strategies

Breadth-First Search

This search moves down layer by layer, checking
all subsets with one feature , then with two
features , and so on. This uses queue data
structure.

Space Complexity makes it impractical in most
cases.

13
Breadth-First Search
Breadth-First Search 3 features a,b,c
a
b
c
a, b
a,c
b,c
a,b,c
14
Search Strategies

Complete Search

Branch Bound Search

It is a variation of depth-first search hence it
is exhaustive search.
If evaluation measure is monotonic, this search
is a complete search and guarantees optimal
subset.

15
Branch Bound Search
Branch Bound Search 3 features a,b,c Bound Beta
12
11
a,b,c
12
15
13
a, b
a,c
b,c
17
17
9
a
b
c
1000
16
Heuristic Search

Quick To Find Solution (Subset of Features)

Finds Near Optimal Solution

More Speed With Little Loss of Optimality

Best-First Search

This is derived from breadth-first search. This
expands its search space layer by layer , and
chooses one best subset at each layer to expand.

Beam Search

17
Best-First Search
Best-First Search 3 features a,b,c
1000
17
19
18
a
b
c
12
13
10
a, b
a,c
b,c
20
a,b,c
18
Search Strategies

Approximate Branch Bound Search

This is an extension of the Branch Bound Search
In this the bound is relaxed by some amount ?,
this allows algorithm to continue and reach
optimal subset. By changing ? , monotonicity of
the measure can be observed.

19
Approximate Branch Bound Search
Approximate Branch Bound Search 3 features
a,b,c
11
a,b,c
13
15
12
a,b
a,c
b,c
9
17
17
a
b
c
1000
20
Nondeterministic Search

Avoid Getting Stuck in Local Minima
Capture The Interdependence of Features

RAND

It keeps only the current best subset.
If sufficiently long running period is allowed
and a good random function is used, it can find
optimal subset. Problem with this algorithm is we
dont know when we reached the optimal subset.
Hence stopping condition is the number of
maximum loops allowed.

21
Evaluation Measures

What is Entropy ?
A Measure of Uncertainty
The Quantity
Purity how close a set of instances is to having
just one label
Impurity (disorder) how close it is to total
uncertainty over labels
The Measure Entropy
Directly proportional to impurity, uncertainty,
irregularity, surprise
Inversely proportional to purity, certainty,
regularity, redundancy
Example
For simplicity, assume H 0, 1, distributed
according to Pr(y)
Can have (more than 2) discrete class labels
Continuous random variables differential entropy
Optimal purity for y either
Pr(y 0) 1, Pr(y 1) 0
Pr(y 1) 1, Pr(y 0) 0
Entropy is 0 if all members of S belong to same
class.

22
EntropyInformation Theoretic Definition

Components
D a set of examples ltx1, c(x1)gt, ltx2, c(x2)gt,
, ltxm, c(xm)gt
p Pr(c(x) ), p- Pr(c(x) -)
Definition
H is defined over a probability density function
p
D contains examples whose frequency of and -
labels indicates p and p- for the observed data
The entropy of D relative to c is H(D) ?
-p logb (p) - p- logb (p-)

If a target attribute can take on c different
values, the entropy of S relative to this c-wise
classification is defined as ,

where pi is the proportion of S belonging to the
class I.

23
Entropy

What is the least pure probability distribution?
Pr(y 0) 0.5, Pr(y 1) 0.5
Corresponds to maximum impurity/uncertainty/irregu
larity/surprise
Property of entropy concave function (concave
downward)

Entropy is 1 when S contains equal number of
positive negative examples.
Entropy specifies the minimum number of bits of
information needed to encode the classification
of an arbitrary member of S.

What Units is H Measured In?
Depends on the base b of the log (bits for b 2,
nats for b e, etc.)
A single bit is required to encode each example
in the worst case (p 0.5)
If there is less uncertainty (e.g., p 0.8), we
can use less than 1 bit each

24
Information Gain

It is a measure of the effectiveness of an
attribute in classifying the training data.
Measures the expected reduction in Entropy caused
by partitioning the examples according to the
attribute.
Measure the uncertainty removed by splitting on
the value of attribute A
The information gain ,Gain(S,A) of an attribute
A, relative to collection of examples S is,

where values(A) is the set of all possible values
of A.
Gain(S,A) is the information provided about the
target function value, given the value of some
attribute A.
The value of Gain(S,A) is the number of bits
saved when encoding the target value of an
arbitrary member of S, by knowing the value of
attribute A.

25
An Illustrative Example
26
Attributes with Many Values
27
Summary Points

Search Measure
Search and measure play dominant role in feature
selection.
Stopping criteria are usually determined by a
particular combination of search measure.
There are different feature selection methods
with different combinations of search
evaluation measures.