A Brief Survey of Machine Learning
  • Used materials from
  • William H. Hsu
  • Linda Jackson
  • Lex Lane
  • Tom Mitchell
  • Machine Learning, Mc Graw Hill 1997
  • Allan Moser
  • Tim Finin,
  • Marie desJardins
  • Chuck Dyer

ML Lectures Outline what we will discuss?
  • Why machine learning?
  • Brief Tour of Machine Learning
  • A case study
  • A taxonomy of learning
  • Intelligent systems engineering specification of
    learning problems
  • Issues in Machine Learning
  • Design choices
  • The performance element intelligent systems
  • Some Applications of Learning
  • Database mining, reasoning (inference/decision
    support), acting
  • Industrial usage of intelligent systems
  • Robotics

What is Learning?
  • Learning denotes changes in a system that ...
    enable a system to do the same task more
    efficiently the next time. -- Herbert Simon
  • Learning is constructing or modifying
    representations of what is being experienced. --
    Ryszard Michalski
  • Learning is making useful changes in our minds.
    -- Marvin Minsky

Why Machine Learning?
  • Discover new things or structures that are
    unknown to humans
  • Examples
  • Data mining,
  • Knowledge Discovery in Databases
  • Fill in skeletal or incomplete specifications
    about a domain
  • Large, complex AI systems cannot be completely
    derived by hand
  • They require dynamic updating to incorporate new
  • Learning new characteristics
  • 1. expands the domain or expertise
  • 2. lessens the "brittleness" of the system
  • Using learning, the software agents can adapt to
  • to their users,
  • to other software agents,
  • to the changing environment.

Why Machine Learning?
  • New Computational Capability
  • Database mining
  • converting (technical) records into knowledge
  • Self-customizing programs
  • learning news filters,
  • adaptive monitors
  • Learning to act
  • robot planning,
  • control optimization,
  • decision support
  • Applications that are hard to program
  • automated driving,
  • speech recognition

Why Machine Learning?
  • Better Understanding of Human Learning and
  • Understand and improve efficiency of human
  • Use to improve methods for teaching and tutoring
  • e.g., better computer-aided instruction.
  • Cognitive science theories of knowledge
    acquisition (e.g., through practice)
  • Performance elements reasoning (inference) and
    recommender systems
  • Time is Right
  • Recent progress in algorithms and theory
  • Rapidly growing volume of online data from
    various sources
  • Available computational power
  • Growth and interest of learning-based industries
    (e.g., data mining/KDD)

A General Model of Learning Agents
Three Aspects of Learning Systems
  • 1. Models
  • decision trees,
  • linear threshold units (winnow, weighted
  • neural networks,
  • Bayesian networks (polytrees, belief networks,
    influence diagrams, HMMs),
  • genetic algorithms,
  • instance-based (nearest-neighbor)
  • 2. Algorithms (e.g., for decision trees)
  • ID3,
  • C4.5,
  • CART,
  • OC1
  • 3. Methodologies
  • supervised,
  • unsupervised,
  • reinforcement
  • knowledge-guided

What are the aspects of research on Learning?
  • 1. Theory of Learning
  • Computational learning theory (COLT) complexity,
    limitations of learning
  • Probably Approximately Correct (PAC) learning
  • Probabilistic, statistical, information theoretic
  • 2. Multistrategy Learning
  • Combining Techniques,
  • Knowledge Sources
  • 3. Create and collect Data
  • Time Series,
  • Very Large Databases (VLDB),
  • Text Corpora
  • 4. Select good applications
  • Performance element
  • classification,
  • decision support,
  • planning,
  • control
  • Database mining and knowledge discovery in
    databases (KDD)
  • Computer inference learning to reason

Some Issues in Machine Learning
  • What Algorithms Can Approximate Functions
    Well? When?
  • How Do Learning System Design Factors Influence
  • Number of training examples
  • Complexity of hypothesis representation
  • How Do Learning Problem Characteristics Influence
  • Noisy data
  • Multiple data sources
  • What Are The Theoretical Limits of Learnability?
  • How Can Prior Knowledge of Learner Help?
  • What Clues Can We Get From Biological Learning
  • How Can Systems Alter Their Own Representation?

Major Paradigms of Machine Learning
  • Rote Learning
  • One-to-one mapping from inputs to stored
  • "Learning by memorization.
  • Association-based storage and retrieval.
  • Clustering
  • Analogue
  • Determine correspondence between two different
  • Induction
  • Use specific examples to reach general
  • Discovery
  • Unsupervised, specific goal not given
  • Genetic Algorithms

Major Paradigms of Machine Learning
  • Neural Networks
  • Reinforcement
  • Feedback given at end of a sequence of steps.
  • Feedback can be positive or negative reward
  • Assign reward to steps by solving the credit
    assignment problem
  • which steps should receive credit or blame for a
    final result?

The Inductive Learning Problem
  • Induce rules that extrapolate from a given set of
  • These rules should make accurate predictions
    about future examples.
  • Supervised versus Unsupervised learning
  • Learn an unknown function f(X) Y, where
  • X is an input example and
  • Y is the desired output.
  • Supervised learning implies we are given a
    training set of (X, Y) pairs by a "teacher."
  • Unsupervised learning means we are only given the
    Xs and some (ultimate) feedback function on our
  • Concept learning
  • Called also Classification
  • Given a set of examples of some
    concept/class/category, determine if a given
    example is an instance of the concept or not.
  • If it is an instance, we call it a positive
  • If it is not, it is called a negative example.

Supervised Concept Learning
  • Given a training set of positive and negative
    examples of a concept
  • Usually each example has a set of
  • Construct a description that will accurately
    classify whether future examples are positive or
  • That is,
  • learn some good estimate of function f
  • given a training set (x1, y1), (x2, y2), ...,
    (xn, yn)
  • where each yi is either (positive) or -
  • f is a function of the features/attributes

Inductive Learning Framework
  • Raw input data from sensors are preprocessed to
    obtain a feature vector, X, that adequately
    describes all of the relevant features for
    classifying examples.
  • Each x is a list of (attribute, value) pairs. For
  • X PersonSue, EyeColorBrown, AgeYoung,
  • The number and names of attributes (aka features)
    is fixed (positive, finite).
  • Each attribute has a fixed, finite number of
    possible values.
  • Each example can be interpreted as a point in an
    n-dimensional feature space, where n is the
    number of attributes.

Inductive Learning by Nearest-Neighbor
  • One simple approach to inductive learning is to
    save each training example as a point in feature
  • Classify a new example by giving it the same
    classification ( or -) as its nearest neighbor
    in Feature Space.
  • 1. A variation involves computing a weighted sum
    of class of a set of neighbors
  • where the weights correspond to distances
  • 2. Another variation uses the center of class
  • The problem with this approach is that it doesn't
    necessarily generalize well if the examples are
    not well "clustered."

Learning Decision Trees
  • Goal Build a decision tree for classifying
    examples as positive or negative instances of a
    concept using supervised learning from a training
  • A decision tree is a tree where
  • each non-leaf node is associated with an
    attribute (feature)
  • each leaf node is associated with a
    classification ( or -)
  • each arc is associated with one of the possible
    values of the attribute at the node where the arc
    is directed from.
  • Generalization allow for gt2 classes
  • e.g., sell, hold, buy

Preference Bias Ockham's Razor
  • Aka Occams Razor, Law of Economy, or Law of
  • Principle stated by William of Ockham
    (1285-1347/49), a scholastic, that
  • non sunt multiplicanda entia praeter
  • or, entities are not to be multiplied beyond
  • The simplest explanation that is consistent with
    all observations is the best.
  • Therefore, the smallest decision tree that
    correctly classifies all of the training examples
    is best.
  • Finding the provably smallest decision tree is
  • Therefore we do not construct the absolute
    smallest tree consistent with the training
  • We construct a tree that is pretty small.

Inductive Learning and Bias
  • Suppose that we want to learn a function f(x) y
    and we are given some sample (x,y) pairs, as in
    figure (a).
  • There are several hypotheses we could make about
    this function, e.g. (b), (c) and (d).
  • A preference for one over the others reveals the
    bias of our learning technique, e.g.
  • prefer piece-wise functions
  • prefer a smooth function
  • prefer a simple function and treat outliers as

Example of using probabilities to create trees
Huffman code
  • In 1952 MIT student David Huffman devised, in the
    course of doing a homework assignment, an elegant
    coding scheme
  • This scheme is optimal in the case where all
    symbols probabilities are integral powers of
  • A Huffman code can be built in the following
  • 1. Rank all symbols in order of probability of
  • 2. Successively combine the two symbols of the
    lowest probability to form a new composite
  • eventually we will build a binary tree where each
    node is the probability of all nodes beneath it.
  • 3. Trace a path to each leaf, noticing the
    direction at each node.

Huffman code example as a prototypical idea from
other area
  • Message Probability.
  • A .125
  • B .125
  • C .25
  • D .5

If we need to send many messages (A,B,C or D) and
they have this probability distribution and we
use this code, then over time, the average
bits/message should approach 1.75 (
  • If a set T of records is partitioned into
    disjoint exhaustive classes (C1,C2,..,Ck) on the
    basis of the value of the categorical attribute,
    then the information needed to identify the class
    of an element of T is
  • Info(T) I(P)
  • where P is probability distribution of
    partition (C1,C2,..,Ck)
  • P (C1/T, C2/T, ..., Ck/T)
  • If we partition T w.r.t attribute X into sets
    T1,T2, ..,Tn then the information needed to
    identify the class of an element of T becomes the
    weighted average of the information needed to
    identify the class of an element of Ti,
  • i.e. the weighted average of Info(Ti)
  • Info(X,T) STi/T Info(Ti) STi/T
    log Ti/T

  • Consider the quantity Gain(X,T) defined as
  • Gain(X,T) Info(T) - Info(X,T)
  • This represents the difference between
  • information needed to identify an element of T
  • information needed to identify an element of T
    after the value of attribute X has been obtained,
  • that is, this is the gain in information due to
    attribute X.
  • We can use this to rank attributes and to build
    decision trees where at each node is located the
    attribute with greatest gain among the attributes
    not yet considered in the path from the root.
  • The intents of this ordering are twofold
  • 1. To create small decision trees so that records
    can be identified after only a few questions.
  • 2. To match a hoped for minimality of the process
    represented by the records being considered
    (Occam's Razor).

We will use this idea to build decision trees, ID3
Rule and Decision Tree Learning
  • Example Rule Acquisition from Historical Data
  • Data
  • Patient 103 (time 1) Age 23, First-Pregnancy
    no, Anemia no, Diabetes no, Previous-Premature-B
    irth no, Ultrasound unknown, Elective
    C-Section unknown, Emergency-C-Section unknown
  • Patient 103 (time 2) Age 23, First-Pregnancy
    no, Anemia no, Diabetes yes, Previous-Premature-
    Birth no, Ultrasound abnormal, Elective
    C-Section no, Emergency-C-Section unknown
  • Patient 103 (time n) Age 23, First-Pregnancy
    no, Anemia no, Diabetes no, Previous-Premature-B
    irth no, Ultrasound unknown, Elective
    C-Section no, Emergency-C-Section YES
  • Learned Rule
  • IF no previous vaginal delivery, AND abnormal 2nd
    trimester ultrasound, AND malpresentation at
    admission, AND no elective C-Section THEN probabil
    ity of emergency C-Section is 0.6
  • Training set 26/41 0.634
  • Test set 12/20 0.600

Neural Network Learning
  • Autonomous Learning Vehicle In a Neural Net
    (ALVINN) Pomerleau et al
  • http//www.cs.cmu.edu/afs/cs/project/alv/member/ww
  • Drives 70mph on highways

Specifying A Learning Problem
  • Learning Improving with Experience at Some Task
  • Improve over task T,
  • with respect to performance measure P,
  • based on experience E.
  • Example Learning to Play Checkers
  • T play games of checkers
  • P percent of games won in world tournament
  • E opportunity to play against self
  • Refining the Problem Specification Issues
  • What experience?
  • What exactly should be learned?
  • How shall it be represented?
  • What specific algorithm to learn it?
  • Defining the Problem Milieu
  • Performance element
  • How shall the results of learning be applied?
  • How shall the performance element be evaluated?
    The learning system?

Example Learning to Play Checkers
A Target Function forLearning to Play Checkers
A Training Procedure for Learning to Play
  • Obtaining Training Examples
  • the target function
  • the learned function
  • the training value
  • One Rule For Estimating Training Values
  • Choose Weight Tuning Rule
  • Least Mean Square (LMS) weight update
    rule REPEAT
  • Select a training example b at random
  • Compute the error(b) for this training
  • For each board feature fi, update weight wi as
    follows where c is a small, constant
    factor to adjust the learning rate

Design Choices forLearning to Play Checkers
Completed Design
Example of Interesting Application Data Mining
Example Reasoning (Inference, Decision Support)
Example Planning and Control
Relevant Disciplines
  • Artificial Intelligence
  • Bayesian Methods
  • Cognitive Science
  • Computational Complexity Theory
  • Control Theory
  • Information Theory
  • Neuroscience
  • Philosophy
  • Psychology
  • Statistics

Optimization Learning Predictors Meta-Learning
Entropy Measures MDL Approaches Optimal Codes
PAC Formalism Mistake Bounds
Language Learning Learning to Reason
Machine Learning
Bayess Theorem Missing Data Estimators
Symbolic Representation Planning/Problem
Solving Knowledge-Guided Learning
Bias/Variance Formalism Confidence
Intervals Hypothesis Testing
ANN Models Modular Learning
Occams Razor Inductive Generalization
Power Law of Practice Heuristic Learning
