CS 9633 Machine Learning - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

CS 9633 Machine Learning

Description:

(Classifying Executables) Three Classes (Malicious, Boring, Funny) Features ... Consider executables problem where instances are conjunctions of boolean features: ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 61
Provided by: brid157
Category:

less

Transcript and Presenter's Notes

Title: CS 9633 Machine Learning


1
CS 9633Machine Learning
  • Computational Learning Theory

Adapted from notes by Tom Mitchell http//www-2.cs
.cmu.edu/tom/mlbook-chapter-slides.html
2
Theoretical Characterization of Learning Problems
  • Under what conditions is successful learning
    possible and impossible?
  • Under what conditions is a particular learning
    algorithm assured of learning successfully?

3
Two Frameworks
  • PAC (Probably Approximately Correct) Learning
    Framework Identify classes of hypotheses that
    can and cannot be learned from a polynomial
    number of training examples
  • Define a natural measure of complexity for
    hypothesis spaces that allows bounding the number
    of training examples needed
  • Mistake Bound Framework

4
Theoretical Questions of Interest
  • Is it possible to identify classes of learning
    problems that are inherently difficult or easy,
    independent of the learning algorithm?
  • Can one characterize the number of training
    examples necessary or sufficient to assure
    successful learning?
  • How is the number of examples affected
  • If observing a random sample of training data?
  • if the learner is allowed to pose queries to the
    trainer?
  • Can one characterize the number of mistakes that
    a learner will make before learning the target
    function?
  • Can one characterize the inherent computational
    complexity of a class of learning algorithms?

5
Computational Learning Theory
  • Relatively recent field
  • Area of intense research
  • Partial answers to some questions on previous
    page is yes.
  • Will generally focus on certain types of learning
    problems.

6
Inductive Learning of Target Function
  • What we are given
  • Hypothesis space
  • Training examples
  • What we want to know
  • How many training examples are sufficient to
    successfully learn the target function?
  • How many mistakes will the learner make before
    succeeding?

7
Questions for Broad Classes of Learning Algorithms
  • Sample complexity
  • How many training examples do we need to
    converge to a successful hypothesis with a high
    probability?
  • Computational complexity
  • How much computational effort is needed to
    converge to a successful hypothesis with a high
    probability?
  • Mistake Bound
  • How many training examples will the learner
    misclassify before converging to a successful
    hypothesis?

8
PAC Learning
  • Probably Approximately Correct Learning Model
  • Will restrict discussion to learning
    boolean-valued concepts in noise-free data.

9
Problem SettingInstances and Concepts
  • X is set of all possible instances over which
    target function may be defined
  • C is set of target concepts learner is to learn
  • Each target concept c in C is a subset of X
  • Each target concept c in C is a boolean function
  • c X?0,1
  • c(x) 1 if x is positive example of concept
  • c(x) 0 otherwise

10
Problem Setting Distribution
  • Instances generated at random using some
    probability distribution D
  • D may be any distribution
  • D is generally not known to the learner
  • D is required to be stationary (does not change
    over time)
  • Training examples x are drawn at random from X
    according to D and presented with target value
    c(x) to the learner.

11
Problem Setting Hypotheses
  • Learner L considers set of hypotheses H
  • After observing a sequence of training examples
    of the target concept c, L must output some
    hypothesis h from H which is its estimate of c

12
Example Problem(Classifying Executables)
  • Three Classes (Malicious, Boring, Funny)
  • Features
  • a1 GUI present (yes/no)
  • a2 Deletes files (yes/no)
  • a3 Allocates memory (yes/no)
  • a4 Creates new thread (yes/no)
  • Distribution?
  • Hypotheses?

13
Instance a1 a2 a3 a4 Class
1 Yes No No Yes B
2 Yes No No No B
3 No Yes Yes No F
4 No No Yes Yes M
5 Yes No No Yes B
6 Yes No No No F
7 Yes Yes Yes No M
8 Yes Yes No Yes M
9 No No No Yes B
10 No No Yes No M
14
True Error
  • Definition The true error (denoted errorD(h))
    of hypothesis h with respect to target concept c
    and distribution D , is the probability that h
    will misclassify an instance drawn at random
    according to D.

15
Error of h with respect to c
Instance space X
-
-
-
c


h

-
16
Key Points
  • True error defined over entire instance space,
    not just training data
  • Error depends strongly on the unknown probability
    distribution D
  • The error of h with respect to c is not directly
    observable to the learner Lcan only observe
    performance with respect to training data
    (training error)
  • Question How probable is it that the observed
    training error for h gives a misleading estimate
    of the true error?

17
PAC Learnability
  • Goal characterize classes of target concepts
    that can be reliably learned
  • from a reasonable number of randomly drawn
    training examples and
  • using a reasonable amount of computation
  • Unreasonable to expect perfect learning where
    errorD(h) 0
  • Would need to provide training examples
    corresponding to every possible instance
  • With random sample of training examples, there is
    always a non-zero probability that the training
    examples will be misleading

18
Weaken Demand on Learner
  • Hypothesis error (Approximately)
  • Will not require a zero error hypothesis
  • Require that error is bounded by some constant ?,
    that can be made arbitrarily small
  • ? is the error parameter
  • Error on training data (Probably)
  • Will not require that the learner succeed on
    every sequence of randomly drawn training
    examples
  • Require that its probability of failure is
    bounded by a constant, ?, that can be made
    arbitrarily small
  • ? is the confidence parameter

19
Definition of PAC-Learnability
  • Definition Consider a concept class C defined
    over a set of instances X of length n and a
    learner L using hypothesis space H. C is
    PAC-learnable by L using H if all c ? C,
    distributions D over X, ? such that 0 lt ? lt ½ ,
    and ? such that 0 lt ? lt ½, learner L will with
    probability at least (1 - ?) output a hypothesis
    h? H such that errorD(h) ? ?, in time that is
    polynomial in 1/?, 1/?, n, and size(c).

20
Requirements of Definition
  • L must with arbitrarily high probability (1-?),
    out put a hypothesis having arbitrarily low error
    (?).
  • Ls learning must be efficientgrows polynomially
    in terms of
  • Strengths of output hypothesis (1/?, 1/?)
  • Inherent complexity of instance space (n) and
    concept class C (size(c)).

21
Block Diagram of PAC Learning Model
Control Parameters ?, ?
Training sample
Hypothesis h
Learning algorithm L
22
Examples of second requirement
  • Consider executables problem where instances are
    conjunctions of boolean features
  • a1yes ? a2no ? a3yes ? a4no
  • Concepts are conjunctions of a subset of the
    features
  • a1yes ? a3yes ? a4yes

23
Using the Concept of PAC Learning in Practice
  • We often want to know how many training instances
    we need in order to achieve a certain level of
    accuracy with a specified probability.
  • If L requires some minimum processing time per
    training example, then for C to be PAC-learnable
    by L, L must learn from a polynomial number of
    training examples.

24
Sample Complexity
  • Sample complexity of a learning problem is the
    growth in the required training examples with
    problem size.
  • Will determine the sample complexity for
    consistent learners.
  • A learner is consistent if it outputs hypotheses
    which perfectly fit the training data whenever
    possible.
  • All algorithms in Chapter 2 are consistent
    learners.

25
Recall definition of VS
  • The version space, denoted VSH,D, with respect to
    hypothesis space H and training examples D, is
    the subset of hypotheses from H consistent with
    the training examples in D

26
VS and PAC learning by consistent learners
  • Every consistent learner outputs a hypothesis
    belonging to the version space, regardless of the
    instance space X, hypothesis space H, or training
    data D.
  • To bound the number of examples needed by any
    consistent learner, we need only to bound the
    number of examples needed to assure that the
    version space contains no unacceptable hypotheses.

27
?-exhausted
  • Definition Consider a hypothesis space H,
    target concept c, instance distribution D, and
    set of training examples D of c. The version
    space VSH,D is said to be ?-exhausted with
    respect to c and D, if every hypothesis h in VH,D
    has error less than ? with respect to c and D.

28
Exhausting the version space
Hypothesis Space H
error 0.2 r0
error 0.1 r0.2
error 0.3 r0.4
VSH,D
error 0.1 r0
error 0.2 r0.3
error 0.3 r0.2
29
Exhausting the Version Space
  • Only an observer who knows the identify of the
    target concept can determine with certainty
    whether the version space is ?-exhausted.
  • But, we can bound the probability that the
    version space will be ?-exhausted after a given
    number of training examples
  • Without knowing the identity of the target
    concept
  • Without knowing the distribution from which
    training examples were drawn

30
Theorem 7.1
  • Theorem 7.1 ?-exhausting the version space. If
    the hypothesis space H is finite, D is a sequence
    of m ? 1 independent randomly drawn examples of
    some target concept c, then for any 0???1, the
    probability that the version space VSH,D is not
    ?-exhausted (with respect to c) is less than or
    equal to
  • He-?m

31
Proof of theorem
  • See text

32
Number of Training Examples (Eq. 7.2)
33
Summary of Result
  • Inequality on previous slide provides a general
    bound on the number of trianing examples
    sufficient for any consistent learner to
    successfully learn any target concept in H, for
    any desired values of ? and ?.
  • This number m of training examples is sufficient
    to assure that any consistent hypothesis will be
  • probably (with probability 1-?)
  • approximately (within error ?) correct.
  • The value of m grows
  • linearly with 1/?
  • logarithmically with 1/?
  • logarithmically with H
  • The bound can be a substantial overestimate.

34
Problem
  • Suppose we have the instance space described for
    the EnjoySports problem
  • Sky (Sunny, Cloudy, Rainy)
  • AirTemp (Warm, Cold)
  • Humidity (Normal, High)
  • Wind (Strong, Weak)
  • Water (Warm, Cold)
  • Forecast (Same, Change)
  • Hypotheses can be as before
  • (?, Warm, Normal, ?, ?, Same) (0, 0, 0, 0, 0,
    0)
  • How many training examples do we need to have an
    error rate of less than 10 with a probability of
    95?

35
Limits of Equation 7.2
  • Equation 7.2 tell us how many training examples
    suffice to ensure (with probability (1-?) that
    every hypothesis having 0 training error, will
    have a true error of at most ?.
  • Problem there may be no hypothesis that is
    consistent with if the concept is not in H. In
    this case, we want the minimum error hypothesis.

36
Agnostic Learning and Inconsistent Hypotheses
  • An Agnostic Learner does not make the assumption
    that the concept is contained in the hypothesis
    space.
  • We may want to consider the hypothesis with the
    minimum error
  • Can derive a bound similar to the previous one

37
Concepts that are PAC-Learnable
  • Proofs that a type of concept is PAC-Learnable
    usually consist of two steps
  • Show that each target concept in C can be learned
    from a polynomial number of training examples
  • Show that the processing time per training
    example is also polynomially bounded

38
PAC Learnability of Conjunctions of Boolean
Literals
  • Class C of target concepts described by
    conjunctions of boolean literals
  • GUI_Present ? ?Opens_files
  • Is C PAC learnable? Yes.
  • Will prove by
  • Showing that a polynomial of training examples
    is needed to learn each concept
  • Demonstrate an algorithm that uses polynomial
    time per training example

39
Examples Needed to Learn Each Concept
  • Consider a consistent learner that uses
    hypothesis space H C
  • Compute number m of random training examples
    sufficient to ensure that L will, with
    probability (1 - ?), output a hypothesis with
    maximum error ?.
  • We will use m ?(1/?)(lnHln(1/?))
  • What is the size of the hypothesis space?

40
Complexity Per Example
  • We just need to show that for some algorithm, we
    can spend a polynomial amount of time per
    training example.
  • One way to do this is to give an algorithm.
  • In this case, we can use Find-S as the learning
    algorithm.
  • Find-S incrementally computes the most specific
    hypothesis consistent with each training example.
  • Old ? Tired
  • Old ? Happy
  • Tired
  • Old ? ?Tired -
  • Rich ? Happy
  • What is a bound on the time per example?

41
Theorem 7.2
  • PAC-learnability of boolean conjunctions. The
    class C of conjunctions of boolean literals is
    PAC-learnable by the FIND-S algorithm using HC

42
Proof of Theorem 7.2
  • Equation 7.4 shows that the sample complexity for
    this concept class id polynomial in n, 1/?, and
    1/?, and independent of size(c). To incrematally
    process each training example, the FIND-S
    algorithm requires effort linear in n and
    independent of 1/?, 1/?, and size(c). Therefore,
    this concept class is PAC-learnable by the FIND-S
    algorithm.

43
Interesting Results
  • Unbiased learners are not PAC learnable because
    they require an exponential number of examples.
  • K-term Disjunctive Normal Form is not PAC
    learnable
  • K-term Conjunctive Normal Form is a superset of
    k-DNF, but it is PAC learnable

44
Sample Complexity with Infinite Hypothesis Spaces
  • Two drawbacks to previous result
  • It often does not give a very tight bound on the
    sample complexity
  • It only applies to finite hypothesis spaces
  • Vapnik-Chervonekis dimension of H (VC dimension)
  • Will give tighter bounds
  • Applies to many infinite hypothesis spaces.

45
Shattering a Set of Instances
  • Consider a subset of instances S from the
    instance space X.
  • Every hypothesis imposes dichotomies on S
  • x?S h(x) 1
  • x?S h(x) 0
  • Given some instance space S, there are 2S
    possible dichotomies.
  • The ability of H to shatter a set of concepts is
    a measure of its capacity to represent target
    concepts defined over these instances.

46
Shattering a Hypothesis Space
  • Definition A set of instances S is shattered by
    hypothesis space H if and only if for every
    dichotomy of S there exists some hypothesis in H
    consistent with this dichotomy.

47
Vapnik-Chervonenkis Dimension
  • Ability to shatter a set of instances is closely
    related to the inductive bias of the hypothesis
    space.
  • An unbiased hypothesis space is one that shatters
    the instance space X.
  • Sometimes H cannot be shattered, but a large
    subset of it can.

48
Vapnik-Chervonenkis Dimension
  • Definition The Vapnik-Chervonenkis dimension,
    VC(H) of hypothesis space H defined over instance
    space X, is the size of the largest finite subset
    of X shattered by H. If arbitrarily large finite
    sets of X can be shattered by H, then VC(H) ?.

49
Shattered Instance Space
50
Example 1 of VC Dimension
  • Instance space X is the set of real numbers X
    R.
  • H is the set of intervals on the real number
    line. Form of H is
  • a lt x lt b
  • What is VC(H)?

51
Shattering the real number line
-1.2
3.4
6.7
What is VC(H)? What is H?
52
Example 2 of VC Dimension
  • Set X of instances corresponding to numbers on
    the x,y plane
  • H is the set of all linear decision surfaces
  • What is VC(H)?

53
Shattering the x-y plane
2 instances
3 instances
VC(H) ? H ?
54
Proving limits on VC dimension
  • If we find any set of instances of size d that
    can be shattered, then VC(H) ? d.
  • To show that VC(H) lt d, we must show that no set
    of size d can be shattered.

55
General result for r dimensional space
  • The VC dimension of linear decision surfaces in
    an r dimensional space is r1.

56
Example 3 of VC dimension
  • Set X of instances are conjunctions of exactly
    three boolean literals
  • young ? happy ? single
  • H is the set of hypothesis described by a
    conjunction of up to 3 boolean literals.
  • What is VC(H)?

57
Shattering conjunctions of literals
  • Approach construct a set of instances of size 3
    that can be shattered. Let instance i have
    positive literal li and all other literals
    negative. Representation of instances that are
    conjunctions of literals l1, l2 and l3 as bit
    strings
  • Instance1 100
  • Instance2 010
  • Instance3 001
  • Construction of dichotomy To exclude an
    instance, add appropriate ?li to the hypothesis.
  • Extend the argument to n literals.
  • Can VC(H) be greater than n (number of literals)?

58
Sample Complexity and the VC dimension
  • Can derive a new bound for the number of randomly
    drawn training examples that suffice to probably
    approximately learn a target concept (how many
    examples do we need to ?-exhaust the version
    space with probability (1-?)?)

59
Comparing the Bounds
60
Lower Bound on Sample Complexity
  • Theorem 7.3 Lower bound on sample complexity.
    Consider any concept class C such that VC(C) ? 2,
    any learner L, and any 0 lt ? lt 1/8, and 0 lt ? lt
    1/100. Then there exists a distribution D and
    target concept in C such that if L observes fewer
    examples than
  • Then with probability at least ?, L outputs a
    hypothesis h having errorD(h) gt ?.
Write a Comment
User Comments (0)
About PowerShow.com