Title: Adattrhzak s kiaknzsuk Osztlyozsi algoritmusok
1Adattárházak és kiaknázásuk Osztályozási
algoritmusok
- dr. Abonyi János
- Veszprémi Egyetem
- www.fmt.vein.hu/softcomp
2Training data
- A collection of records (objects) x. Each record
contains a set of features and the class C that
it belongs to. -
-
3Predictive Modelling (Classification)
Linear Classifier
Non Linear Classifier
debt
o
o
o
o
o
o
o
o
o
o
income
aincome bdebt lt t gt No loan !
4Example
5Predictive Modeling
Goal learn a mapping y f(x?) Need 1. A
model structure 2. A score function 3. An
optimization strategy Categorical y ? c1,,cm
classification Real-valued y regression Note
usually assume c1,,cm are mutually exclusive
and exhaustive
6Classifier Types
Discrimination direct mapping from x to
c1,,cm - e.g. perceptron, SVM,
CART Regression model p(ck x ) - e.g.
logistic regression, CART Class-conditional
model p(x ck , ?k) - e.g. Bayesian
classifiers, LDA
7Evaluation of Classification Systems
Training Set examples with class values for
learning. Test Set examples with class values
for evaluating. Evaluation Hypotheses are used
to infer classification of examples in the test
set inferred classification is compared to known
classification. Accuracy percentage of examples
in the test set that are classified correctly.
8Algorithms for supervised learning
- Neural networks (general non-linear models,
adaptivity, artificial brain) (previous lesson) - Bayes (Linear/Fisher Discriminate Analysis)
- Decision trees (logical rules)
- k-NN (k-Nearest Neighbors) (simple
non-parametric) - SVM (Support Vector Machines)
9Bayesian Classification Why?
- Probabilistic learning Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems - Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge
can be combined with observed data. - Probabilistic prediction Predict multiple
hypotheses, weighted by their probabilities - Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
10Bayesian classification
- The classification problem may be formalized
using a-posteriori probabilities - P(CX) prob. that the sample tuple
Xltx1,,xkgt is of class C. - E.g. P(classN outlooksunny,windytrue,)
- Idea assign to sample X the class label C such
that P(CX) is maximal
11Estimating a-posteriori probabilities
- Bayes theorem
- P(CX) P(XC)P(C) / P(X)
- P(X) is constant for all classes
- P(C) relative freq of class C samples
- C such that P(CX) is maximum C such that
P(XC)P(C) is maximum - Problem computing P(XC)
12Naïve Bayesian Classification
- Naïve assumption attribute independence
- P(x1,,xkC) P(x1C)P(xkC)
- If i-th attribute is categoricalP(xiC) is
estimated as the relative freq of samples having
value xi as i-th attribute in class C - If i-th attribute is continuousP(xiC) is
estimated thru a Gaussian density function - Computationally easy in both cases
13Play-tennis example estimating P(xiC)
14Play-tennis example classifying X
- An unseen sample X ltrain, hot, high, falsegt
- P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
ep)P(p) 3/92/93/96/99/14 0.010582 - P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286 - Sample X is classified in class n (dont play)
15The independence hypothesis
- makes computation possible
- yields optimal classifiers when satisfied
- but is seldom satisfied in practice, as
attributes (variables) are often correlated. - Attempts to overcome this limitation
- Bayesian networks, that combine Bayesian
reasoning with causal relationships between
attributes
16Lets try
17Linear Discriminant Analysis
Could model each class density as multivariate
normal
18Example
19A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
20Example
21Classification by Decision Tree Induction
- Decision tree
- A flow-chart-like tree structure
- Internal node denotes a test on an attribute
- Branch represents an outcome of the test
- Leaf nodes represent class labels or class
distribution - Decision tree generation consists of two phases
- Tree construction
- At start, all the training examples are at the
root - Partition examples recursively based on selected
attributes - Tree pruning
- Identify and remove branches that reflect noise
or outliers - Use of decision tree Classifying an unknown
sample - Test the attribute values of the sample against
the decision tree
22Algorithm for Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Attributes are categorical (if continuous-valued,
they are discretized in advance) - Examples are partitioned recursively based on
selected attributes - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf - There are no samples left
23Attribute Selection Measure
- Information gain (ID3/C4.5)
- All attributes are assumed to be categorical
- Can be modified for continuous-valued attributes
- Gini index (IBM IntelligentMiner)
- All attributes are assumed continuous-valued
- Assume there exist several possible split values
for each attribute - May need other tools, such as clustering, to get
the possible split values - Can be modified for categorical attributes
24Entropy I.
- S is a sample of training examples
- p is the proportion of positive P examples
- p- is the proportion of negative N examples
- Entropy measures the impurity of S
- Entropy(S) -p log2 p - p- log2 p-
25Entropy II.
- Entropy(S) expected number of bits needed to
encode class ( or -) of randomly drawn members
of S (under the optimal, shortest length-code) - Why?
- Information theory optimal length code assign
- log2 p bits to messages having probability
p. - So the expected number of bits to encode
- ( or -) of random member of S
- -p log2 p - p- log2 p-
26Information Gain in Decision Tree Induction
- Assume that using attribute A a set S will be
partitioned into sets S1, S2 , , Sv - If Si contains pi examples of P and ni examples
of N, the entropy, or the expected information
needed to classify objects in all subtrees Si is - The encoding information that would be gained by
branching on A
27Example of Information Gain
Entropy(29,35-) -29/64 log2 29/64 35/64
log2 35/64 0.99
Entropy(18,33-) 0.94 Entropy(8,30-)
0.62 Gain(S,A2)Entropy(S)
-51/64Entropy(18,33-)
-13/64Entropy(11,2-) 0.12
- Entropy(21,5-) 0.71
- Entropy(8,30-) 0.74
- Gain(S,A1)Entropy(S)
- -26/64Entropy(21,5-)
- -38/64Entropy(8,30-)
- 0.27
28Attribute Selection by Information Gain
Computation
- Class P buys_computer yes
- Class N buys_computer no
- I(p, n) I(9, 5) 0.940
- Compute the entropy for age
29The result
30Gini Index (IBM IntelligentMiner)
- If a data set T contains examples from n classes,
gini index, gini(T) is defined as -
- where pj is the relative frequency of class j in
T. - If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as - The attribute provides the smallest ginisplit(T)
is chosen to split the node (need to enumerate
all possible splitting points for each attribute).
31Extracting Classification Rules from Trees
- Represent the knowledge in the form of IF-THEN
rules - One rule is created for each path from the root
to a leaf - Each attribute-value pair along a path forms a
conjunction - The leaf node holds the class prediction
- Rules are easier for humans to understand
- IF age lt30 AND student no THEN
buys_computer no - IF age lt30 AND student yes THEN
buys_computer yes - IF age 3140 THEN buys_computer yes
- IF age gt40 AND credit_rating excellent
THEN buys_computer yes - IF age gt40 AND credit_rating fair THEN
buys_computer no
32Avoid Overfitting in Classification
- The generated tree may overfit the training data
- Too many branches, some may reflect anomalies due
to noise or outliers - Result is in poor accuracy for unseen samples
- Two approaches to avoid overfitting
- Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold - Difficult to choose an appropriate threshold
- Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees - Use a set of data different from the training
data to decide which is the best pruned tree
33Approaches to Determine the Final Tree Size
- Use all the data for training
- but apply a statistical test (e.g., chi-square)
to estimate whether expanding or pruning a node
may improve the entire distribution - Use minimum description length (MDL) principle
- halting growth of the tree when the encoding is
minimized - Use cross validation
34Cross-Validation
- Estimate the accuracy of a hypothesis induced by
a supervised learning algorithm - Predict the accuracy of a hypothesis over future
unseen instances - Select the optimal hypothesis from a given set of
alternative hypotheses - Pruning decision trees
- Model selection
- Feature selection
- Combining multiple classifiers (boosting)
35Cross-Validation
- k-fold cross-validation splits the data set D
into k mutually exclusive subsets D1,D2,,Dk - Train and test the learning algorithm k times,
each time it is trained on D\Di and tested on Di
D1
D2
D3
D4
D1
D2
D3
D4
D1
D2
D3
D4
D1
D2
D3
D4
D1
D2
D3
D4
acccv 1/n ? (vi,yi)?D ?(I(D\Di,vi),yi)
36Enhancements to decision tree induction
- Allow for continuous-valued attributes
- Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals - Handle missing attribute values
- Assign the most common value of the attribute
- Assign probability to each of the possible values
- Attribute construction
- Create new attributes based on existing ones that
are sparsely represented - This reduces fragmentation, repetition, and
replication
37Lets try
38Instance-Based Methods
- Instance-based learning
- Store training examples and delay the processing
(lazy evaluation) until a new instance must be
classified - Typical approaches
- k-nearest neighbor approach
- Instances represented as points in a Euclidean
space. - Locally weighted regression
- Constructs local approximation
39Nearest Neighbor
- Given a distance metric
- Assign class to be the same as its nearest
neighbor - All training data is used during operation
- Multi-class decision framework
40Example
41The k-Nearest Neighbor Algorithm
- All instances correspond to points in the n-D
space. - The nearest neighbor are defined in terms of
Euclidean distance. - The target function could be discrete- or real-
valued. - For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq. - Vonoroi diagram the decision surface induced by
1-NN for a typical set of training examples.
.
_
_
_
.
_
.
.
.
_
xq
.
_
42Discussion on the k-NN Algorithm
- The k-NN algorithm for continuous-valued target
functions - Calculate the mean values of the k nearest
neighbors - Distance-weighted nearest neighbor algorithm
- Weight the contribution of each of the k
neighbors according to their distance to the
query point xq - giving greater weight to closer neighbors
- Robust to noisy data by averaging k-nearest
neighbors - Curse of dimensionality distance between
neighbors could be dominated by irrelevant
attributes. - To overcome it, axes stretch or elimination of
the least relevant attributes.
43k-NN Algorithm Example I.
44k-NN Algorithm Example II.
45k-NN Algorithm Example III.
46Attributes with Cost
- Consider
- Medical diagnosis blood test costs 1000 USD
- How to learn a consistent tree with low expected
cost? - Replace Gain by
- Gain2(S,A)/Cost(A) Tan, Schimmer 1990
- 2Gain(S,A)-1/(Cost(A)1)w w ?0,1 Nunez 1988
47Linear SVM - Separable Case
Consider the problem of separating the set of
training vectors belonging to two separate
classes, D (x1 , y1 ) (xl ,
yl ) xi ? Rd, yi ? -1,1 with
hyperplane w?xb0 The set of vectors is said
to be optimally separated by the hyperplane if it
is separated without error and the distance
between the closest vector to the hyperplane is
maximal.
48Linear SVM
- Let d (d-) be the shortest distance from the
hyperplane to the closest positive (negative)
example. - The margin of the hyperplane is defined to be d
d-
49- separating hyperplane w?xb 0
- decision function f(x) sgn(w?xb)
50Hence the hyperplane that optimally separates the
data is the one that minimize Subject to
51- dual problem
- maximize
- subject to ?i?0 and
According to Kuhn-Tucker condition only the
points which satisfy will have non-zero
Lagrange multipliers. These points are termed
Support Vectors (SV).
52w?xb0
Support vector
53Linear SVM - Non-Separable Case
54Linear SVM - Non-Separable Case
l observations consisting of a pair xi ? Rd,
i1,,l and the associated label yi ?
-1,1 Introduce positive slack variables
?i and modify the objective function to be
corresponds to the separable case
55(No Transcript)
56(No Transcript)
57Non-Linear SVM
58(No Transcript)
59(No Transcript)
60Összefoglalás
- Mi az osztályozás, miben különbözik a
csoportosítástól? - Mi a k-NN algoritmus lényege?
- Döntési fa, információs entrópia, nyereség
- Bayes osztályozás
- SVM
61What is the best modelaccuracy vs.
generalization
- Find a model that avoids overfitting too high
accuracy on the training set may result in poor
generalization (classification accuracy on new
instances of the data).
62How to choose feature space?
adults
kids
weights
estrogen
heights
testosteron
63(No Transcript)