Adattrhzak s kiaknzsuk Osztlyozsi algoritmusok - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Adattrhzak s kiaknzsuk Osztlyozsi algoritmusok

Description:

SVM (Support Vector Machines) Bayesian Classification: Why? ... apply a statistical test (e.g., chi-square) to estimate whether expanding or ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 64
Provided by: abony
Category:

less

Transcript and Presenter's Notes

Title: Adattrhzak s kiaknzsuk Osztlyozsi algoritmusok


1
Adattárházak és kiaknázásuk Osztályozási
algoritmusok
  • dr. Abonyi János
  • Veszprémi Egyetem
  • www.fmt.vein.hu/softcomp

2
Training data
  • A collection of records (objects) x. Each record
    contains a set of features and the class C that
    it belongs to.

3
Predictive Modelling (Classification)
Linear Classifier
Non Linear Classifier
debt


o
o

o

o
o

o




o
o

o

o
income
aincome bdebt lt t gt No loan !
4
Example
5
Predictive Modeling
Goal learn a mapping y f(x?) Need 1. A
model structure 2. A score function 3. An
optimization strategy Categorical y ? c1,,cm
classification Real-valued y regression Note
usually assume c1,,cm are mutually exclusive
and exhaustive
6
Classifier Types
Discrimination direct mapping from x to
c1,,cm - e.g. perceptron, SVM,
CART Regression model p(ck x ) - e.g.
logistic regression, CART Class-conditional
model p(x ck , ?k) - e.g. Bayesian
classifiers, LDA
7
Evaluation of Classification Systems
Training Set examples with class values for
learning. Test Set examples with class values
for evaluating. Evaluation Hypotheses are used
to infer classification of examples in the test
set inferred classification is compared to known
classification. Accuracy percentage of examples
in the test set that are classified correctly.
8
Algorithms for supervised learning
  • Neural networks (general non-linear models,
    adaptivity, artificial brain) (previous lesson)
  • Bayes (Linear/Fisher Discriminate Analysis)
  • Decision trees (logical rules)
  • k-NN (k-Nearest Neighbors) (simple
    non-parametric)
  • SVM (Support Vector Machines)

9
Bayesian Classification Why?
  • Probabilistic learning Calculate explicit
    probabilities for hypothesis, among the most
    practical approaches to certain types of learning
    problems
  • Incremental Each training example can
    incrementally increase/decrease the probability
    that a hypothesis is correct. Prior knowledge
    can be combined with observed data.
  • Probabilistic prediction Predict multiple
    hypotheses, weighted by their probabilities
  • Standard Even when Bayesian methods are
    computationally intractable, they can provide a
    standard of optimal decision making against which
    other methods can be measured

10
Bayesian classification
  • The classification problem may be formalized
    using a-posteriori probabilities
  • P(CX) prob. that the sample tuple
    Xltx1,,xkgt is of class C.
  • E.g. P(classN outlooksunny,windytrue,)
  • Idea assign to sample X the class label C such
    that P(CX) is maximal

11
Estimating a-posteriori probabilities
  • Bayes theorem
  • P(CX) P(XC)P(C) / P(X)
  • P(X) is constant for all classes
  • P(C) relative freq of class C samples
  • C such that P(CX) is maximum C such that
    P(XC)P(C) is maximum
  • Problem computing P(XC)

12
Naïve Bayesian Classification
  • Naïve assumption attribute independence
  • P(x1,,xkC) P(x1C)P(xkC)
  • If i-th attribute is categoricalP(xiC) is
    estimated as the relative freq of samples having
    value xi as i-th attribute in class C
  • If i-th attribute is continuousP(xiC) is
    estimated thru a Gaussian density function
  • Computationally easy in both cases

13
Play-tennis example estimating P(xiC)
14
Play-tennis example classifying X
  • An unseen sample X ltrain, hot, high, falsegt
  • P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
    ep)P(p) 3/92/93/96/99/14 0.010582
  • P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
    en)P(n) 2/52/54/52/55/14 0.018286
  • Sample X is classified in class n (dont play)

15
The independence hypothesis
  • makes computation possible
  • yields optimal classifiers when satisfied
  • but is seldom satisfied in practice, as
    attributes (variables) are often correlated.
  • Attempts to overcome this limitation
  • Bayesian networks, that combine Bayesian
    reasoning with causal relationships between
    attributes

16
Lets try
17
Linear Discriminant Analysis
Could model each class density as multivariate
normal
18
Example
19
A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
20
Example
21
Classification by Decision Tree Induction
  • Decision tree
  • A flow-chart-like tree structure
  • Internal node denotes a test on an attribute
  • Branch represents an outcome of the test
  • Leaf nodes represent class labels or class
    distribution
  • Decision tree generation consists of two phases
  • Tree construction
  • At start, all the training examples are at the
    root
  • Partition examples recursively based on selected
    attributes
  • Tree pruning
  • Identify and remove branches that reflect noise
    or outliers
  • Use of decision tree Classifying an unknown
    sample
  • Test the attribute values of the sample against
    the decision tree

22
Algorithm for Decision Tree Induction
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Attributes are categorical (if continuous-valued,
    they are discretized in advance)
  • Examples are partitioned recursively based on
    selected attributes
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Conditions for stopping partitioning
  • All samples for a given node belong to the same
    class
  • There are no remaining attributes for further
    partitioning majority voting is employed for
    classifying the leaf
  • There are no samples left

23
Attribute Selection Measure
  • Information gain (ID3/C4.5)
  • All attributes are assumed to be categorical
  • Can be modified for continuous-valued attributes
  • Gini index (IBM IntelligentMiner)
  • All attributes are assumed continuous-valued
  • Assume there exist several possible split values
    for each attribute
  • May need other tools, such as clustering, to get
    the possible split values
  • Can be modified for categorical attributes

24
Entropy I.
  • S is a sample of training examples
  • p is the proportion of positive P examples
  • p- is the proportion of negative N examples
  • Entropy measures the impurity of S
  • Entropy(S) -p log2 p - p- log2 p-

25
Entropy II.
  • Entropy(S) expected number of bits needed to
    encode class ( or -) of randomly drawn members
    of S (under the optimal, shortest length-code)
  • Why?
  • Information theory optimal length code assign
  • log2 p bits to messages having probability
    p.
  • So the expected number of bits to encode
  • ( or -) of random member of S
  • -p log2 p - p- log2 p-

26
Information Gain in Decision Tree Induction
  • Assume that using attribute A a set S will be
    partitioned into sets S1, S2 , , Sv
  • If Si contains pi examples of P and ni examples
    of N, the entropy, or the expected information
    needed to classify objects in all subtrees Si is
  • The encoding information that would be gained by
    branching on A

27
Example of Information Gain
Entropy(29,35-) -29/64 log2 29/64 35/64
log2 35/64 0.99
Entropy(18,33-) 0.94 Entropy(8,30-)
0.62 Gain(S,A2)Entropy(S)
-51/64Entropy(18,33-)
-13/64Entropy(11,2-) 0.12
  • Entropy(21,5-) 0.71
  • Entropy(8,30-) 0.74
  • Gain(S,A1)Entropy(S)
  • -26/64Entropy(21,5-)
  • -38/64Entropy(8,30-)
  • 0.27

28
Attribute Selection by Information Gain
Computation
  • Hence
  • Similarly
  • Class P buys_computer yes
  • Class N buys_computer no
  • I(p, n) I(9, 5) 0.940
  • Compute the entropy for age

29
The result
30
Gini Index (IBM IntelligentMiner)
  • If a data set T contains examples from n classes,
    gini index, gini(T) is defined as
  • where pj is the relative frequency of class j in
    T.
  • If a data set T is split into two subsets T1 and
    T2 with sizes N1 and N2 respectively, the gini
    index of the split data contains examples from n
    classes, the gini index gini(T) is defined as
  • The attribute provides the smallest ginisplit(T)
    is chosen to split the node (need to enumerate
    all possible splitting points for each attribute).

31
Extracting Classification Rules from Trees
  • Represent the knowledge in the form of IF-THEN
    rules
  • One rule is created for each path from the root
    to a leaf
  • Each attribute-value pair along a path forms a
    conjunction
  • The leaf node holds the class prediction
  • Rules are easier for humans to understand
  • IF age lt30 AND student no THEN
    buys_computer no
  • IF age lt30 AND student yes THEN
    buys_computer yes
  • IF age 3140 THEN buys_computer yes
  • IF age gt40 AND credit_rating excellent
    THEN buys_computer yes
  • IF age gt40 AND credit_rating fair THEN
    buys_computer no

32
Avoid Overfitting in Classification
  • The generated tree may overfit the training data
  • Too many branches, some may reflect anomalies due
    to noise or outliers
  • Result is in poor accuracy for unseen samples
  • Two approaches to avoid overfitting
  • Prepruning Halt tree construction earlydo not
    split a node if this would result in the goodness
    measure falling below a threshold
  • Difficult to choose an appropriate threshold
  • Postpruning Remove branches from a fully grown
    treeget a sequence of progressively pruned trees
  • Use a set of data different from the training
    data to decide which is the best pruned tree

33
Approaches to Determine the Final Tree Size
  • Use all the data for training
  • but apply a statistical test (e.g., chi-square)
    to estimate whether expanding or pruning a node
    may improve the entire distribution
  • Use minimum description length (MDL) principle
  • halting growth of the tree when the encoding is
    minimized
  • Use cross validation

34
Cross-Validation
  • Estimate the accuracy of a hypothesis induced by
    a supervised learning algorithm
  • Predict the accuracy of a hypothesis over future
    unseen instances
  • Select the optimal hypothesis from a given set of
    alternative hypotheses
  • Pruning decision trees
  • Model selection
  • Feature selection
  • Combining multiple classifiers (boosting)

35
Cross-Validation
  • k-fold cross-validation splits the data set D
    into k mutually exclusive subsets D1,D2,,Dk
  • Train and test the learning algorithm k times,
    each time it is trained on D\Di and tested on Di

D1
D2
D3
D4
D1
D2
D3
D4
D1
D2
D3
D4
D1
D2
D3
D4
D1
D2
D3
D4
acccv 1/n ? (vi,yi)?D ?(I(D\Di,vi),yi)
36
Enhancements to decision tree induction
  • Allow for continuous-valued attributes
  • Dynamically define new discrete-valued attributes
    that partition the continuous attribute value
    into a discrete set of intervals
  • Handle missing attribute values
  • Assign the most common value of the attribute
  • Assign probability to each of the possible values
  • Attribute construction
  • Create new attributes based on existing ones that
    are sparsely represented
  • This reduces fragmentation, repetition, and
    replication

37
Lets try
38
Instance-Based Methods
  • Instance-based learning
  • Store training examples and delay the processing
    (lazy evaluation) until a new instance must be
    classified
  • Typical approaches
  • k-nearest neighbor approach
  • Instances represented as points in a Euclidean
    space.
  • Locally weighted regression
  • Constructs local approximation

39
Nearest Neighbor
  • Given a distance metric
  • Assign class to be the same as its nearest
    neighbor
  • All training data is used during operation
  • Multi-class decision framework

40
Example
41
The k-Nearest Neighbor Algorithm
  • All instances correspond to points in the n-D
    space.
  • The nearest neighbor are defined in terms of
    Euclidean distance.
  • The target function could be discrete- or real-
    valued.
  • For discrete-valued, the k-NN returns the most
    common value among the k training examples
    nearest to xq.
  • Vonoroi diagram the decision surface induced by
    1-NN for a typical set of training examples.

.
_
_
_
.
_
.

.

.
_

xq
.
_

42
Discussion on the k-NN Algorithm
  • The k-NN algorithm for continuous-valued target
    functions
  • Calculate the mean values of the k nearest
    neighbors
  • Distance-weighted nearest neighbor algorithm
  • Weight the contribution of each of the k
    neighbors according to their distance to the
    query point xq
  • giving greater weight to closer neighbors
  • Robust to noisy data by averaging k-nearest
    neighbors
  • Curse of dimensionality distance between
    neighbors could be dominated by irrelevant
    attributes.
  • To overcome it, axes stretch or elimination of
    the least relevant attributes.

43
k-NN Algorithm Example I.
44
k-NN Algorithm Example II.
45
k-NN Algorithm Example III.
46
Attributes with Cost
  • Consider
  • Medical diagnosis blood test costs 1000 USD
  • How to learn a consistent tree with low expected
    cost?
  • Replace Gain by
  • Gain2(S,A)/Cost(A) Tan, Schimmer 1990
  • 2Gain(S,A)-1/(Cost(A)1)w w ?0,1 Nunez 1988

47
Linear SVM - Separable Case
Consider the problem of separating the set of
training vectors belonging to two separate
classes, D (x1 , y1 ) (xl ,
yl ) xi ? Rd, yi ? -1,1 with
hyperplane w?xb0 The set of vectors is said
to be optimally separated by the hyperplane if it
is separated without error and the distance
between the closest vector to the hyperplane is
maximal.
48
Linear SVM
  • Let d (d-) be the shortest distance from the
    hyperplane to the closest positive (negative)
    example.
  • The margin of the hyperplane is defined to be d
    d-

49
  • separating hyperplane w?xb 0
  • decision function f(x) sgn(w?xb)

50
Hence the hyperplane that optimally separates the
data is the one that minimize Subject to
51
  • dual problem
  • maximize
  • subject to ?i?0 and


According to Kuhn-Tucker condition only the
points which satisfy will have non-zero
Lagrange multipliers. These points are termed
Support Vectors (SV).
52
w?xb0
Support vector
53
Linear SVM - Non-Separable Case
54
Linear SVM - Non-Separable Case
l observations consisting of a pair xi ? Rd,
i1,,l and the associated label yi ?
-1,1 Introduce positive slack variables
?i and modify the objective function to be
corresponds to the separable case
55
(No Transcript)
56
(No Transcript)
57
Non-Linear SVM
58
(No Transcript)
59
(No Transcript)
60
Összefoglalás
  • Mi az osztályozás, miben különbözik a
    csoportosítástól?
  • Mi a k-NN algoritmus lényege?
  • Döntési fa, információs entrópia, nyereség
  • Bayes osztályozás
  • SVM

61
What is the best modelaccuracy vs.
generalization
  • Find a model that avoids overfitting too high
    accuracy on the training set may result in poor
    generalization (classification accuracy on new
    instances of the data).

62
How to choose feature space?
adults
kids
weights
estrogen
heights
testosteron
63
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com