Adattrhzak s kiaknzsuk Osztlyozsi algoritmusok

About This Presentation

Title:

Adattrhzak s kiaknzsuk Osztlyozsi algoritmusok

Description:

SVM (Support Vector Machines) Bayesian Classification: Why? ... apply a statistical test (e.g., chi-square) to estimate whether expanding or ... – PowerPoint PPT presentation

Number of Views:12

Avg rating:3.0/5.0

Slides: 64

Provided by: abony

Category:

more less

Transcript and Presenter's Notes

Title: Adattrhzak s kiaknzsuk Osztlyozsi algoritmusok

1
Adattárházak és kiaknázásuk Osztályozási
algoritmusok

dr. Abonyi János
Veszprémi Egyetem
www.fmt.vein.hu/softcomp

2
Training data

A collection of records (objects) x. Each record
contains a set of features and the class C that
it belongs to.

3
Predictive Modelling (Classification)
Linear Classifier
Non Linear Classifier
debt

o
o

o

o
o

o

o
o

o

o
income
aincome bdebt lt t gt No loan !
4
Example
5
Predictive Modeling
Goal learn a mapping y f(x?) Need 1. A
model structure 2. A score function 3. An
optimization strategy Categorical y ? c1,,cm
classification Real-valued y regression Note
usually assume c1,,cm are mutually exclusive
and exhaustive
6
Classifier Types
Discrimination direct mapping from x to
c1,,cm - e.g. perceptron, SVM,
CART Regression model p(ck x ) - e.g.
logistic regression, CART Class-conditional
model p(x ck , ?k) - e.g. Bayesian
classifiers, LDA
7
Evaluation of Classification Systems
Training Set examples with class values for
learning. Test Set examples with class values
for evaluating. Evaluation Hypotheses are used
to infer classification of examples in the test
set inferred classification is compared to known
classification. Accuracy percentage of examples
in the test set that are classified correctly.
8
Algorithms for supervised learning

Neural networks (general non-linear models,
adaptivity, artificial brain) (previous lesson)
Bayes (Linear/Fisher Discriminate Analysis)
Decision trees (logical rules)
k-NN (k-Nearest Neighbors) (simple
non-parametric)
SVM (Support Vector Machines)

9
Bayesian Classification Why?

Probabilistic learning Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems
Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge
can be combined with observed data.
Probabilistic prediction Predict multiple
hypotheses, weighted by their probabilities
Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured

10
Bayesian classification

The classification problem may be formalized
using a-posteriori probabilities
P(CX) prob. that the sample tuple
Xltx1,,xkgt is of class C.
E.g. P(classN outlooksunny,windytrue,)
Idea assign to sample X the class label C such
that P(CX) is maximal

11
Estimating a-posteriori probabilities

Bayes theorem
P(CX) P(XC)P(C) / P(X)
P(X) is constant for all classes
P(C) relative freq of class C samples
C such that P(CX) is maximum C such that
P(XC)P(C) is maximum
Problem computing P(XC)

12
Naïve Bayesian Classification

Naïve assumption attribute independence
P(x1,,xkC) P(x1C)P(xkC)
If i-th attribute is categoricalP(xiC) is
estimated as the relative freq of samples having
value xi as i-th attribute in class C
If i-th attribute is continuousP(xiC) is
estimated thru a Gaussian density function
Computationally easy in both cases

13
Play-tennis example estimating P(xiC)
14
Play-tennis example classifying X

An unseen sample X ltrain, hot, high, falsegt
P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
ep)P(p) 3/92/93/96/99/14 0.010582
P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286
Sample X is classified in class n (dont play)

15
The independence hypothesis

makes computation possible
yields optimal classifiers when satisfied
but is seldom satisfied in practice, as
attributes (variables) are often correlated.
Attempts to overcome this limitation
Bayesian networks, that combine Bayesian
reasoning with causal relationships between
attributes

16
Lets try
17
Linear Discriminant Analysis
Could model each class density as multivariate
normal
18
Example
19
A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
20
Example
21
Classification by Decision Tree Induction

Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the
root
Partition examples recursively based on selected
attributes
Tree pruning
Identify and remove branches that reflect noise
or outliers
Use of decision tree Classifying an unknown
sample
Test the attribute values of the sample against
the decision tree

22
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they are discretized in advance)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no samples left

23
Attribute Selection Measure

Information gain (ID3/C4.5)
All attributes are assumed to be categorical
Can be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner)
All attributes are assumed continuous-valued
Assume there exist several possible split values
for each attribute
May need other tools, such as clustering, to get
the possible split values
Can be modified for categorical attributes

24
Entropy I.

S is a sample of training examples
p is the proportion of positive P examples
p- is the proportion of negative N examples
Entropy measures the impurity of S
Entropy(S) -p log2 p - p- log2 p-

25
Entropy II.

Entropy(S) expected number of bits needed to
encode class ( or -) of randomly drawn members
of S (under the optimal, shortest length-code)
Why?
Information theory optimal length code assign
log2 p bits to messages having probability
p.
So the expected number of bits to encode
( or -) of random member of S
-p log2 p - p- log2 p-

26
Information Gain in Decision Tree Induction

Assume that using attribute A a set S will be
partitioned into sets S1, S2 , , Sv
If Si contains pi examples of P and ni examples
of N, the entropy, or the expected information
needed to classify objects in all subtrees Si is
The encoding information that would be gained by
branching on A

27
Example of Information Gain
Entropy(29,35-) -29/64 log2 29/64 35/64
log2 35/64 0.99
Entropy(18,33-) 0.94 Entropy(8,30-)
0.62 Gain(S,A2)Entropy(S)
-51/64Entropy(18,33-)
-13/64Entropy(11,2-) 0.12

Entropy(21,5-) 0.71
Entropy(8,30-) 0.74
Gain(S,A1)Entropy(S)
-26/64Entropy(21,5-)
-38/64Entropy(8,30-)
0.27

28
Attribute Selection by Information Gain
Computation

Hence
Similarly

Class P buys_computer yes
Class N buys_computer no
I(p, n) I(9, 5) 0.940
Compute the entropy for age

29
The result
30
Gini Index (IBM IntelligentMiner)

If a data set T contains examples from n classes,
gini index, gini(T) is defined as
where pj is the relative frequency of class j in
T.
If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as
The attribute provides the smallest ginisplit(T)
is chosen to split the node (need to enumerate
all possible splitting points for each attribute).

31
Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN
rules
One rule is created for each path from the root
to a leaf
Each attribute-value pair along a path forms a
conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
IF age lt30 AND student no THEN
buys_computer no
IF age lt30 AND student yes THEN
buys_computer yes
IF age 3140 THEN buys_computer yes
IF age gt40 AND credit_rating excellent
THEN buys_computer yes
IF age gt40 AND credit_rating fair THEN
buys_computer no

32
Avoid Overfitting in Classification

The generated tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training
data to decide which is the best pruned tree

33
Approaches to Determine the Final Tree Size

Use all the data for training
but apply a statistical test (e.g., chi-square)
to estimate whether expanding or pruning a node
may improve the entire distribution
Use minimum description length (MDL) principle
halting growth of the tree when the encoding is
minimized
Use cross validation

34
Cross-Validation

Estimate the accuracy of a hypothesis induced by
a supervised learning algorithm
Predict the accuracy of a hypothesis over future
unseen instances
Select the optimal hypothesis from a given set of
alternative hypotheses
Pruning decision trees
Model selection
Feature selection
Combining multiple classifiers (boosting)

35
Cross-Validation

k-fold cross-validation splits the data set D
into k mutually exclusive subsets D1,D2,,Dk
Train and test the learning algorithm k times,
each time it is trained on D\Di and tested on Di

D1
D2
D3
D4
D1
D2
D3
D4
D1
D2
D3
D4
D1
D2
D3
D4
D1
D2
D3
D4
acccv 1/n ? (vi,yi)?D ?(I(D\Di,vi),yi)
36
Enhancements to decision tree induction

Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that
are sparsely represented
This reduces fragmentation, repetition, and
replication

37
Lets try
38
Instance-Based Methods

Instance-based learning
Store training examples and delay the processing
(lazy evaluation) until a new instance must be
classified
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean
space.
Locally weighted regression
Constructs local approximation

39
Nearest Neighbor

Given a distance metric
Assign class to be the same as its nearest
neighbor
All training data is used during operation
Multi-class decision framework

40
Example
41
The k-Nearest Neighbor Algorithm

All instances correspond to points in the n-D
space.
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete- or real-
valued.
For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
Vonoroi diagram the decision surface induced by
1-NN for a typical set of training examples.

.
_
_
_
.
_
.

.

.
_

xq
.
_

42
Discussion on the k-NN Algorithm

The k-NN algorithm for continuous-valued target
functions
Calculate the mean values of the k nearest
neighbors
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k
neighbors according to their distance to the
query point xq
giving greater weight to closer neighbors
Robust to noisy data by averaging k-nearest
neighbors
Curse of dimensionality distance between
neighbors could be dominated by irrelevant
attributes.
To overcome it, axes stretch or elimination of
the least relevant attributes.

43
k-NN Algorithm Example I.
44
k-NN Algorithm Example II.
45
k-NN Algorithm Example III.
46
Attributes with Cost

Consider
Medical diagnosis blood test costs 1000 USD
How to learn a consistent tree with low expected
cost?
Replace Gain by
Gain2(S,A)/Cost(A) Tan, Schimmer 1990
2Gain(S,A)-1/(Cost(A)1)w w ?0,1 Nunez 1988

47
Linear SVM - Separable Case
Consider the problem of separating the set of
training vectors belonging to two separate
classes, D (x1 , y1 ) (xl ,
yl ) xi ? Rd, yi ? -1,1 with
hyperplane w?xb0 The set of vectors is said
to be optimally separated by the hyperplane if it
is separated without error and the distance
between the closest vector to the hyperplane is
maximal.
48
Linear SVM

Let d (d-) be the shortest distance from the
hyperplane to the closest positive (negative)
example.
The margin of the hyperplane is defined to be d
d-

separating hyperplane w?xb 0
decision function f(x) sgn(w?xb)

50
Hence the hyperplane that optimally separates the
data is the one that minimize Subject to
51

dual problem
maximize
subject to ?i?0 and

According to Kuhn-Tucker condition only the
points which satisfy will have non-zero
Lagrange multipliers. These points are termed
Support Vectors (SV).
52
w?xb0
Support vector
53
Linear SVM - Non-Separable Case
54
Linear SVM - Non-Separable Case
l observations consisting of a pair xi ? Rd,
i1,,l and the associated label yi ?
-1,1 Introduce positive slack variables
?i and modify the objective function to be
corresponds to the separable case
55
(No Transcript)
56
(No Transcript)
57
Non-Linear SVM
58
(No Transcript)
59
(No Transcript)
60
Összefoglalás

Mi az osztályozás, miben különbözik a
csoportosítástól?
Mi a k-NN algoritmus lényege?
Döntési fa, információs entrópia, nyereség
Bayes osztályozás
SVM

61
What is the best modelaccuracy vs.
generalization

Find a model that avoids overfitting too high
accuracy on the training set may result in poor
generalization (classification accuracy on new
instances of the data).

62
How to choose feature space?
adults
kids
weights
estrogen
heights
testosteron
63
(No Transcript)

Write a Comment

User Comments (0)