A Brief Tour of Machine Learning - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

A Brief Tour of Machine Learning

Description:

Play Tennis. Play Monopoly. Causal links between features can be modelled ... How to test leave one out, cross validation, stratify, online, offline... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 33

Provided by: dav5178

Category:

more less

Transcript and Presenter's Notes

Title: A Brief Tour of Machine Learning

1
A Brief Tour of Machine Learning

David Lindsay

2
What is Machine Learning?

Very multidisciplinary field statistics,
mathematics, artificial intelligence, psychology,
philosophy, cognitive science
In a nutshell developing algorithms that learn
from data
Historically flourished from advances in
computing in the early 60s, resurgence in the
late 90s

3
Main areas in Machine Learning
1 Supervised learning
assumes a teacher exists to label/annotate data
2 Unsupervised learning
no need for a teacher, try to learn relationships
automatically
3 Reinforcement learning
biologically plausible, try to learn from
reward/punishment stimuli/feedback
4
Supervised Learning

Learning with a teacher

5
More about Supervised Learning

Perhaps the most well studied area of machine
learning lots of nice theory adapted from
statistics/mathematics.
Assume the existence of a training and test set
Main sub-areas of research are
Pattern recognition (discrete labels)
Regression (continuous labels)
Time series analysis (temporal dependence in data)

i.i.d. assumption commonly made
6
The formalisation of data

How to we formally describe our data?

Property of the object that we want to predict in
the future using our training data e.g..
screening cancer labels could be Y normal,
benign, malignant
Label

Commonly represented as a feature vector this
describes the object
Object
The individual features can be real, discrete,
symbolic eg. patient symptoms temperature, sex,
eye colour
7
The formalisation of data (continued)

What is training and test data?

y
7
6
1
7
?
?
2
New test images labels either not known or
withheld from the learner
x
Training set of images
We learn from the training data, and try to
predict new unseen test data. More formally we
have a set of n training and test examples
(information pairs object label) from the
some unknown probability distribution P(X,Y).
8
More about Pattern Recognition

Lots of algorithms/techniques the main
contenders
Support Vector Machines (SVM)
Nearest Neighbours
Decision Trees
Neural Networks
Multivariate Statistics
Bayesian algorithms
Logic programming

9
The mighty SVM algorithm

Very popular technique lots of followers,
relatively new
Very simple technique related to the
Perceptron, is a linear classifier (separates
data into half spaces).

Concept keep the classifier simple, dont over
fit the data ? the classifier generalises well on
new test data (Occams razor)
Concept if data not linearly separable use a
kernel ? F map into another higher dimensional
feature space and data may be separable
10
Hot topics in SVMs

Kernel design central to the application to
data, eg. when the objects are text documents,
the features are words ? incorporate domain
knowledge about grammar.
Applying the kernel technique to other learning
algorithms e.g.. Neural Networks

11
The trusty old Nearest Neighbour algorithm

Born in the 60s probably the most simple of
all algorithms to understand.
Decision rule classify new test examples by
finding the closest neighbouring example in the
training set and predict the same label as the
closest.
Lots of theory justifying its convergence
properties.
Very lazy technique, not very fast has to
search for each test example.

12
Problems with Nearest Neighbours

View examples in Euclidean space, can be very
sensitive to feature scaling.
Finding computationally efficient ways to search
for the Nearest Neighbour example.

13
Decision Trees

Many different varieties C4.5, CART, ID3
Algorithms build classification rules using a
tree of if-then statements.
Constructs tree using Minimum Description Length
(MDL) principles (tries to make the tree as
simple as possible)

IF temperature 65
Patient has fever
IF dehydrated yes
Patient has flu
Patient has pneumonia
14
Benefits/Issues with Decision Trees

Instability minor changes to training data
makes huge changes to decision tree
User can visualise/interpret the hypothesis
directly, can find interesting classification
rules
Problems with continuous real attributes, must be
discretalised.
Large AI following, and widely used in industry

15
Mystical Neural Networks

Very flexible, learning is a gradient descent
process (back propagation)
Training neural networks involves a lot of design
choices
what network structure, how many hidden layers
how to encode the data (must be values 0,1)
use momentum to speed up convergence
Use weight decay to keep simple

16
Training a neural network
Learnt hypothesis is represented by the weights
that interconnect each neuron
The aim in training the neural network is find
the weight vector w that minimises the error E(w)
on the training set
Sigmoid function
Gradient descent problem
17
Interesting applications

Bioinformatics
genetic/protein code analysis
microarray analysis
gene regulatory pathways
WWW
classifying text/html documents
filtering images
filtering emails

18
Bayesian Algorithms

Try to model interrelationships between variables
probabilistically.
Can model expert/domain knowledge directly into
the classifier as prior belief in certain events.
Use basic axioms of probability theory to extract
probabilistic estimates

19
Bayesian algorithms in practice

Lots of different algorithms Relevance Vector
Machine (RVM), Naïve Bayes, Simple Bayes,
Bayesian Belief Networks (BBN)
Has a large following especially Microsoft
Research

Weather sunny
Causal links between features can be modelled
Temperature
Humidity 100
Play Monopoly
Play Tennis
20
Issues with Bayesian algorithms

Tractability to find solutions need numerical
approximations or take computational shortcuts
Can model causal relationships between variables
Need lots of data to estimate probabilties using
obsevered training data frequencies

21
Very important side problems

Feature Selection/Extraction Using Principle
Component Analysis, Wavelets, Cananonical
Correlation, Factor Analysis, Independent
Component Analysis
Imputation what to do with missing features?
Visualisation make the hypothesis human
readable/interpretable
Meta learning how to add functionality to
existing algorithms, or combine the prediction of
many classifiers (Boosting, Bagging, Confidence
and Probability Machines)

22
Very important side problems (continued)

How to incorporate domain knowledge into a
learner
Trade off between complexity (accuracy on
training) vs. generalisation (accuracy on test)
Pre-processing of data, normalising,
standardising, discretalising.
How to test leave one out, cross validation,
stratify, online, offline

23
Unsupervised Learning

Learning without a teacher

24
An introduction to Unsupervised Learning

No need for a teacher/supervisor
Mainly clustering trying to group objects into
sensible clusters
Novelty detection finding strange examples in
data

Clustering examples
Novelty detection
25
Algorithms available

For clustering EM algorithm, K-Means, Self
Organising Maps (SOM)
For novelty detection 1-Class SVM, support
vector regression, Neural Networks

26
Issues and Applications

Very useful for extracting information from data.
Used in medicine to identify disease sub types.
Used to cluster web documents automatically
Used to identify customer target groups in
buisness
Not much publicly available data to test
algorithms with

27
Reinforcement Learning

Learning inspired by nature

28
An introduction

Most biologically plausible feedback given
through stimuli reward/punishment
A field with a lot of theory needing for real
life applications (other than playing BackGammon)
But also encompasses the large field of
Evolutionary Computing
Applications are more open ended
Getting closer to what public consider AI.

29
Traditional Reinforcement Learning