Statistical Modelling and Computational Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Modelling and Computational Learning

Description:

Statistical Modelling and Computational Learning John Shawe-Taylor Department of Computer Science UCL – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Statistical Modelling and Computational Learning


1
Statistical Modelling and Computational Learning
  • John Shawe-Taylor
  • Department of Computer Science
  • UCL

2
Detecting patterns in data
  • The aim of statistical modelling and
    computational learning is to identify some stable
    pattern from a finite sample of data.
  • Based on the detected pattern, we can then
    process future data more accurately, more
    efficiently, etc.
  • The approach is useful when we are unable to
    specify the precise pattern a priori the sample
    must be used to identify it.

3
Statistical Modelling vs Computational Learning
  • Statistical Modelling
  • Interested in inferring underlying structure, eg
    clusters, parametric model, underpinning network
  • Frequently use prior distribution and inference
    to estimate posterior over possible models
  • Computational Learning
  • Interested in ability to predict/process future
    data accurately, eg classifiers, regression,
    mappings
  • Aims to give assurances of performance on unseen
    data

4
Types of analysis
  • Classification which of two or more classes does
    an example belong to? Eg document filtering.
  • Regression what is the real valued function of
    the input that matches the output values? Eg
    function approximation.
  • Novelty detection which new examples are unusual
    or anomalous? Eg engine monitoring.
  • Ranking what is the rank that should be assigned
    to a given data item? Eg recommender systems.
  • Clustering natural structure in the data
  • Network interactions that will persist

5
Principled methods
  • The key theoretical question is whether detected
    patterns are spurious or stable are they in
    this sample by chance or are they implicit in the
    distribution/system generating the data.
  • This is a probabilistic or statistical question
    estimate the stability of any patterns detected
    in the finite sample.
  • What is the chance that we have been fooled by
    the sample?

6
Uniform convergence
  • Typically we are attempting to assess an infinite
    (or very large) class of potential patterns.
  • The question is whether the finite sample
    behaviour of the pattern will match its future
    presence.
  • For one function this needs a bound on the tail
    of the distribution, but for a class we need
    convergence uniformly over the class, cf.
    multiple hypothesis testing.

7
High dimensional spaces
  • There is a further problem that modern
    applications typically involve high dimensional
    data.
  • Kernel methods also project data into high
    dimensional spaces in order to ensure that linear
    methods are powerful enough.
  • Provides a general toolkit of methods, but..
  • it kills more traditional statistical methods of
    analysis the curse of dimensionality.

8
Luckiness framework (S-T, Bartlett, Williamson
and Anthony, 1998)
  • This allows you to use evidence of an effective
    low dimensionality even when in high dimensions.
  • Examples are
  • size of the margin in a support vector machine
    classifier,
  • norm of the weight vector in (ridge) regression
  • sparsity of a solution
  • evidence in the Bayesian posterior
  • residual in principle components analysis

9
Luckiness Framework
  • Wins if the distribution of examples aligns with
    the pattern
  • Hence can work with far richer set of hypotheses
  • Best known example is the support vector machine
    that uses the margin as a measure of luckiness

10
Other luckiness approaches
  • Large margins are just one way of detecting
    fortuitous distributions
  • Sparsity of representation appears to be a more
    fundamental measure
  • Gives rise to different algorithms, but the
    luckiness of large margins can also be justified
    by a sparsity argument
  • Brings us full circle back to Ockhams razor
    parsimonious description

11
Kernel-based learning
  • The idea that we look for patterns in richer sets
    of hypotheses has given rise to kernel methods
  • Data is projected into a rich feature space where
    linear patterns are sought using an implicit
    kernel representation
  • The luckiness is typically measured by the norm
    of the linear function required for the task
    equivalent to the margin in the case of
    classification

12
Modularity of kernel methods
  • Algorithms can be applied to different types of
    data by simply defining an appropriate kernel
    function to compare data items
  • Range of algorithms has been extended to include
    regression, ranking, novelty detection,
  • Can be used for statistical tests that test the
    null hypothesis that two samples are drawn from
    the same distribution
  • Also enables links with statistical modelling
    techniques

13
Kernel design methods
  • Structure kernels use dynamic programming to
    compute sums over possible matches, eg String
    kernels
  • Kernels from probabilistic models eg Fisher
    kernels.
  • Kernel PCA to perform feature selection eg
    Latent Semantic kernels.
  • Kernel alignment/PLS tuning features to the
    task.

14
Cross-modal analysis
  • Not limited to same representation.
  • Consider web images and associated text
  • Kernel for image includes wavelets and colour
    histograms for text is bag of words.
  • Creates a content based image retrieval system

15
Conclusion
  • Statistical Modelling and Computational Learning
    aim to find patterns in data
  • SM interested in reliability of pattern, CL in
    quality of prediction
  • Using bounds to guide algorithm design can
    overcome problems with high dimensions
  • Combined with kernels allows the use of linear
    methods efficiently in high dimensional spaces
  • Methods provide a general toolkit for the
    practitioner making use of adaptive systems
  • Particular applications require specific tasks
    and representations/kernels that define new
    research challenges

16
Conclusion
  • Statistical Modelling and Computational Learning
    aim to find patterns in data SM interested in
    reliability of pattern, CL in quality of
    prediction
  • Using bounds to guide algorithm design can
    overcome problems with high dimensions
  • Combined with kernels allows the use of linear
    methods efficiently in high dimensional spaces
  • Methods provide a general toolkit for the
    practitioner making use of adaptive systems
  • Particular applications require specific tasks
    and representations/kernels

17
On-line optimisation of web pages
  • Have to decide what content to place on a web
    page eg advertising banners
  • Have some measure of response eg click through,
    purchases made, etc.
  • Can be viewed as an example of the one arm bandit
    problem need to trade exploration and
    exploitation on-line
  • Touch Clarity was a company providing this
    technology for high street banks and others

18
Upper confidence bounds
  • Basic algorithm maintains estimates of the
    response rates (RR) for each arm (piece of
    content) together with standard deviations
  • Serve arm for which mean S.D. is maximum
    guaranteed to either be good arm or reduce the
    S.D.
  • Provable bounds on loss compared to performing
    the best arm all along
  • Can be generalised to situation where additional
    information linearly determines response rate
    (RR)
  • Kernel methods enable handling of non-linear RRs
Write a Comment
User Comments (0)
About PowerShow.com