Title: A Brief Tour of Machine Learning
1A Brief Tour of Machine Learning
2What is Machine Learning?
- Very multidisciplinary field statistics,
mathematics, artificial intelligence, psychology,
philosophy, cognitive science - In a nutshell developing algorithms that learn
from data - Historically flourished from advances in
computing in the early 60s, resurgence in the
late 90s
3Main areas in Machine Learning
1 Supervised learning
assumes a teacher exists to label/annotate data
2 Unsupervised learning
no need for a teacher, try to learn relationships
automatically
3 Reinforcement learning
biologically plausible, try to learn from
reward/punishment stimuli/feedback
4Supervised Learning
5More about Supervised Learning
- Perhaps the most well studied area of machine
learning lots of nice theory adapted from
statistics/mathematics. - Assume the existence of a training and test set
- Main sub-areas of research are
- Pattern recognition (discrete labels)
- Regression (continuous labels)
- Time series analysis (temporal dependence in data)
i.i.d. assumption commonly made
6The formalisation of data
- How to we formally describe our data?
Property of the object that we want to predict in
the future using our training data e.g..
screening cancer labels could be Y normal,
benign, malignant
Label
Commonly represented as a feature vector this
describes the object
Object
The individual features can be real, discrete,
symbolic eg. patient symptoms temperature, sex,
eye colour
7The formalisation of data (continued)
- What is training and test data?
y
7
6
1
7
?
?
2
New test images labels either not known or
withheld from the learner
x
Training set of images
We learn from the training data, and try to
predict new unseen test data. More formally we
have a set of n training and test examples
(information pairs object label) from the
some unknown probability distribution P(X,Y).
8More about Pattern Recognition
- Lots of algorithms/techniques the main
contenders - Support Vector Machines (SVM)
- Nearest Neighbours
- Decision Trees
- Neural Networks
- Multivariate Statistics
- Bayesian algorithms
- Logic programming
9The mighty SVM algorithm
- Very popular technique lots of followers,
relatively new - Very simple technique related to the
Perceptron, is a linear classifier (separates
data into half spaces).
Concept keep the classifier simple, dont over
fit the data ? the classifier generalises well on
new test data (Occams razor)
Concept if data not linearly separable use a
kernel ? F map into another higher dimensional
feature space and data may be separable
10Hot topics in SVMs
- Kernel design central to the application to
data, eg. when the objects are text documents,
the features are words ? incorporate domain
knowledge about grammar. - Applying the kernel technique to other learning
algorithms e.g.. Neural Networks
11The trusty old Nearest Neighbour algorithm
- Born in the 60s probably the most simple of
all algorithms to understand. - Decision rule classify new test examples by
finding the closest neighbouring example in the
training set and predict the same label as the
closest. - Lots of theory justifying its convergence
properties. - Very lazy technique, not very fast has to
search for each test example.
12Problems with Nearest Neighbours
- View examples in Euclidean space, can be very
sensitive to feature scaling. - Finding computationally efficient ways to search
for the Nearest Neighbour example.
13Decision Trees
- Many different varieties C4.5, CART, ID3
- Algorithms build classification rules using a
tree of if-then statements. - Constructs tree using Minimum Description Length
(MDL) principles (tries to make the tree as
simple as possible)
IF temperature 65
Patient has fever
IF dehydrated yes
Patient has flu
Patient has pneumonia
14Benefits/Issues with Decision Trees
- Instability minor changes to training data
makes huge changes to decision tree - User can visualise/interpret the hypothesis
directly, can find interesting classification
rules - Problems with continuous real attributes, must be
discretalised. - Large AI following, and widely used in industry
15Mystical Neural Networks
- Very flexible, learning is a gradient descent
process (back propagation) - Training neural networks involves a lot of design
choices - what network structure, how many hidden layers
- how to encode the data (must be values 0,1)
- use momentum to speed up convergence
- Use weight decay to keep simple
16Training a neural network
Learnt hypothesis is represented by the weights
that interconnect each neuron
The aim in training the neural network is find
the weight vector w that minimises the error E(w)
on the training set
Sigmoid function
Gradient descent problem
17Interesting applications
- Bioinformatics
- genetic/protein code analysis
- microarray analysis
- gene regulatory pathways
- WWW
- classifying text/html documents
- filtering images
- filtering emails
18Bayesian Algorithms
- Try to model interrelationships between variables
probabilistically. - Can model expert/domain knowledge directly into
the classifier as prior belief in certain events. - Use basic axioms of probability theory to extract
probabilistic estimates
19Bayesian algorithms in practice
- Lots of different algorithms Relevance Vector
Machine (RVM), Naïve Bayes, Simple Bayes,
Bayesian Belief Networks (BBN) - Has a large following especially Microsoft
Research
Weather sunny
Causal links between features can be modelled
Temperature
Humidity 100
Play Monopoly
Play Tennis
20Issues with Bayesian algorithms
- Tractability to find solutions need numerical
approximations or take computational shortcuts - Can model causal relationships between variables
- Need lots of data to estimate probabilties using
obsevered training data frequencies
21Very important side problems
- Feature Selection/Extraction Using Principle
Component Analysis, Wavelets, Cananonical
Correlation, Factor Analysis, Independent
Component Analysis - Imputation what to do with missing features?
- Visualisation make the hypothesis human
readable/interpretable - Meta learning how to add functionality to
existing algorithms, or combine the prediction of
many classifiers (Boosting, Bagging, Confidence
and Probability Machines)
22Very important side problems (continued)
- How to incorporate domain knowledge into a
learner - Trade off between complexity (accuracy on
training) vs. generalisation (accuracy on test) - Pre-processing of data, normalising,
standardising, discretalising. - How to test leave one out, cross validation,
stratify, online, offline
23Unsupervised Learning
- Learning without a teacher
24An introduction to Unsupervised Learning
- No need for a teacher/supervisor
- Mainly clustering trying to group objects into
sensible clusters - Novelty detection finding strange examples in
data
Clustering examples
Novelty detection
25Algorithms available
- For clustering EM algorithm, K-Means, Self
Organising Maps (SOM) - For novelty detection 1-Class SVM, support
vector regression, Neural Networks
26Issues and Applications
- Very useful for extracting information from data.
- Used in medicine to identify disease sub types.
- Used to cluster web documents automatically
- Used to identify customer target groups in
buisness - Not much publicly available data to test
algorithms with
27Reinforcement Learning
- Learning inspired by nature
28An introduction
- Most biologically plausible feedback given
through stimuli reward/punishment - A field with a lot of theory needing for real
life applications (other than playing BackGammon) - But also encompasses the large field of
Evolutionary Computing - Applications are more open ended
- Getting closer to what public consider AI.
29Traditional Reinforcement Learning
- Techniques use dynamic programming to search for
optimal strategy - Algorithms search to maximise their reward.
- Q Learning (Chris Watkins next door) is most
well known technique. - Only successful applications are to games and toy
problems. - A lack of real life applications.
- Very few researchers in this field.
30Evolutionary Computing
- Inspired by the process of biological evolution.
- Essentially an optimisation technique the
problem is encoded as a chromosome. - We find new/better solutions to problem by sexual
reproduction and mutation. - This will encourage mutation
31Techniques available in Evolutionary Computing
- Lower level optimisers
- Evolutionary Programming, Evolutionary Algorithms
- Genetic Programming, Genetic Algorithms,
- Evolutionary Strategy
- Simulated Annealing
- Higher level optimisers
- TABU search
- Multi-objective optimisation
Pareto front of optimal solutions which one
should we pick?
Objective 2
Objective 1
32Issues in Evolutionary Computing
- How to encode the problem is very important
- Setting mutation/crossover rates is very adhoc
- Very computationally/memory intensive
- Not much theory can be developed frowned upon
by machine learning theorists