Machine Learning for Data Mining - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Machine Learning for Data Mining

Description:

Magnification factors and curvatures of the Bernoulli Trait manifold on handwritten digits data ... Hierarchy of Local Magnification Factors. 60 ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 65

Provided by: atak4

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning for Data Mining

1
Machine Learning for Data Mining

Ata Kaban
The University of Birmingham

2
Overview

Branches of machine learning
Roadmap of unsupervised learning
Some example applications from my research work

3
Machine learning

Data y1, y2, (e.g. sensory inputs)
Supervised learning
The machine is given desired outputs z1, z2,
attached to the input data
Goal learn to produce the correct output for a
new input
Reinforcement learning
The machine can also produce actions, which
affect the state of the environment (the data),
and receives rewards / punishments.
Goal Learn to act in a way that maximises
rewards in the long term.
Unsupervised learning
Goal build a model of the data, which can be
used for reasoning, explanation, decision
making, prediction, etc. or to make other tasks
easier

4
Overview

Branches of machine learning
Roadmap of unsupervised learning
Some example applications from my research work

5
Unsupervised learning

Goals Finding useful representations
Finding clusters
Dimensionality reduction
Building topographic maps
Finding hidden causes that explain the data
Modelling the data density
Uses
Data compression
Outlier detection
Make other learning tasks easier
A theory of human learning perception

6
More examples of problems
a more appropriate model should consider some
conceptual dimensions instead of words.
(Gardenfors)

Finding topics, meanings, intentions
The two choice questionnaire problem
Word saliency

7
Latent variable models a useful formalism

Hidden (latent) variables
Topics
Intentions
Relative importance

Observed variables
Stream
of words
(with grammar syntax)

Assuming that there is a systematic relationship
between these two categories of variables,
We try to find out the hidden variables of
interest from the observed variables.

Modelling and inference
modelling
Specify how the hidden variables of interest
might have generated the observed variables
Infer the inverse stochastic mapping from hidden
variables to observed variables

inference
data
Hiddencauses
stochastic mapping
9

Probabilistic latent variable models

latent variable
observed data
noise
parameters
Linear models with Gaussian latent prior FA,
PPCA, PCA Finite mixture models Linear models
with non-Gaussian latent prior IFA, ICA,
PP Non-linear models local liner, explicit
non-linear GTM
10
what are those acronyms

These are some classical and quite useful
techniques for data analysis
FA Factor Analysis
PCA Principal Component Analysis
PPCA Probabilistic Principal Component Analysis
MoG Mixture of Gaussians
IFA Independent Factor Analysis
ICA Independent Component Analysis
GTM Generative Topographic Mapping

11
Applications of latent variable models

Tools for discovering latent structure in the
observable data
Data analysis visualisation
Application domains
Data mining,
Tele-communications
Bio-informatics
Fraud detection
Information retrieval
Marketing analysis

12
3 basic types of latent variable models the
intuition
13
Linear latent variable models

I. Models with Gaussian latent prior p(x)N(0,I)
Factor Analysis (FA) nN(0,S), S diagonal
Probabilistic Principal Components Analysis
(PPCA) n(0,s2I) ? PCA n d(0)
II. Mixture models p(x)Skd(x-xk)p(xk)
Mixture of Gaussians (MoG) nN(0,S) or (0,s2I)
III. Models with non-Gaussian (e.g. sparse)
latent priors p(x) non-Gaussian
Independent Factor Analysis (IFA) nN(0,S),
p(x)SkNx(0,s) p(k)
Independent Component Analysis (ICA) n d(0),
p(x) non-Gaussian, e.g. Laplace

- Compression - Visualisation
- Clustering

Signal separation
Structure discovery
Visualisation
Clustering

14
D UUT D
Compression with PCA
15
Text Data Compression can capture synonymy
16
Term x Documents Matrix
17
This is called LSA
Latent Semantic Analysis Querytheory
application
It is performed by SVD Singular Value
Decomposition, which is closely related to PCA
Project documents E-1 UT D and query words E-1
VT DT in the same space
Query (small) document
18
Clustering

of users web browsing behaviours from Internet
Information Server logs for msnbc.com in one days
time.
17 page categories frontpage, news, tech, local,
opinion, on-air, misc, weather, health, living,
business, sports, summary, bbs, travel, msn-news,
msn-sports
Example data
1 1
2
3 2 2 4 2 2 2 3 3
5
6 7 7 7 6 6 8 8 8 8
etc

19
(No Transcript)
20
Non-Gaussian Latent Variable Models ICA
solving inverse problems

Blind source separation (the coctail party
problem)
Image denoising
Medical signal processing fMRI, ECG, EEG
Modelling of the visual cortex
Feature extraction for face recognition
Compression, redundancy reduction
Clustering
Time series analysis

21
The Cocktail Party Problem
Original (hidden) sources
22
Observations Linear mixtures of the sources
23
Recovered sources
24
Overview

Branches of machine learning
Roadmap of unsupervised learning
Further examples applications

25
Galaxy spectra

Elliptical galaxies
oldest galactic systems
believed to consist of a single population of
old stars
recent theories indicate the presence of younger
populations of stars
what does the data tell us?

26
What does the data tell us?

A Kaban, L Nolan and S Raychaudhury. Finding
Young Stellar Populations in Elliptical Galaxies
from Independent Components of Optical Spectra.
Proc. SIAM International Conference on Data
Mining (SDM05), 2005.
LA Nolan, M Harva, A Kaban and S Raychaudhury, A
data-driven Bayesian approach to finding young
stellar populations in early-type galaxies from
their UV-optical spectra, Mon. Not. of the Royal
Astron. Soc. (MNRAS), 366(1), pp. 321-338.
LA Nolan, S Raychaudhury, A Kaban. Young stellar
populations in early-type galaxies in the Sloan
Digital Sky Survey, accepted to MNRAS.

27
(No Transcript)
28
Reference A Kaban X Wang ECML06
Finding communities from a dynamic network

An analogy Deconvolutive Source Separation (aka
the coctail party problem)

Microphone(s) record a mixture of signals
Convolutive mixing due to echo
Task is to recover the individual signals
Studied in continuous signal processing

29
Computer-mediated discussion

Convolution occurs due to various time delay
factors network transmission bandwidth,
differences in speed of typing

The activity is logged as a sequence of discrete
events
We try to model user participation dynamics based
on this sequence

30
Dynamic Social Networks
Example of activity log from an IRC Chatroom
31
Results
Observed 1st order connectivity
Analysis by our Deconvolutive State Clustering
model
32
The clusters (comunities) developing over time
Note bursts of activity are being discovered
clusters
time
33
Scaling of our algorithm
34
Reference X Wang and A Kaban Proc. Discovery
Science 2006.
Model based inference of word saliency from text
Lowest saliency words from the 20-Newsgroups
corpus
35
Interpretation of a piece of text from
talk.politics.mideast underlined
saliencygt0.8 normal font saliencies between
0.40- 0.8 grey saliency lt0.4
36
Induced geometry of the topical and common word
distributions
?1
?
colour the estimated saliency
?2
37
Improved text classification
Data sets
Classification results
38
Modelling inhomogeneous documents

Multinomial model components
PLSA, LDA

Independent Bernoulli model components
Aspect Bernoulli

39
Reference Blei et al LDA. JMLR, 2003
40
What are the Bernoulli components?
Ref A Kaban E Bingham Proc. SDM03E. Bingham
A Kaban Submitted to JAIR
Word presences normally tend to have topical
causes. Word absences also have non-topical
causes (a noise factor) Note in terrorist
messages the word presences might have
non-topical causes too -)
41
What does the non-topical factor tell us?

Given a document, which are the word absences
that have the posterior P(phantomn,t,xtn)
highest (amongst all the posteriors of other
factors)?
Equivalently, we can think of it as removing the
noise factor.

42
(No Transcript)
43
A visual example Explaining each pixel
Example input data instances
Components identified from the data (Beta
posterior expectations)
Explanation of each pixel value of each data
instance in terms of how likely is it explained
by any of the components. Darker means higher
probability. Note that the white pixels in the
corners of a raster are explained by
content-bearing components whereas the occluded
pixels come from a phantom-component.
44
Predictive modelling of heterogeneous sequence
collections
Reference A. Kaban Predictive Modelling of
Heterogeneous Sequence Collections by Topographic
Ordering of Histories. Machine Learning (accepted
subject to minor revisions)
45

How to model heterogeneous behaviour?
Shared behavioural patterns (analogous to
procedures of computer programs)
These are the basis of multiple relationships
between users and groups of users
Existing models are either global or assume
homogeneous prototypical behaviour within groups

46
Example

A set of 1-st order Markov Chains combine to
generate sequences by interleaving in various
proportions of participation
Task
Estimate shared generator-chains
Infer the proportions of participation

47
Prototypes vs aspects in the model
Can be shown that the model estimation algorithm
minimises a weighted sum of entropies of the
parameters.
48
Robust to sample size issues outperforms the
state-of- the art
49
A summary overview of the large sequence
collection in terms of lists of most probable
sequences at equal locations of the map
50
Instead of conclusions
All models are wrong but some are useful (Cox)

Data generated by natural processes typically
contain redundancy
One can throw away a lot of detail from the data
and still keep essential features
Simple models can be successful
Complicated models may need infeasible amounts of
data to estimate

51
old stuff

The Latent Trait Model family as a general
framework for data visualisation
Local geometric properties of LTMs
Hierarchical LTMs
Topographic visualisation of state evolution
Conclusions

1 Kabán, A and Girolami, M., A Combined Latent
Class and Trait Model for the Analysis and
Visualisation of Discrete Data..IEEE Transactions
on Pattern Analysis and Machine Intelligence,
23(8), pp. 859872, 2001.
52
The Generative Latent Trait Model Family
non-linear
Latent space (space of the hidden structure)
Observable multi-dimensional data space
Aim infer and visualise the structure of the
data as much as possible
53
Text based document representation modelling
54
(No Transcript)
55

The Latent Trait Model family as a general
framework for data visualisation
Local Magnification Factors of the LT Manifolds
Hierarchical LTMs
Topographic visualisation of state evolution
Conclusions

56
Magnification factors and curvatures of the
Bernoulli Trait manifold on handwritten digits
data
57

Brief introduction to generative and latent
variable models
Text generation models
The Latent Trait Model family as a general
framework for data visualisation
Local geometric properties of the LT manifolds
Hierarchical LTMs
Topographic visualisation of state evolution
Conclusions

2 Kabán, A Tino, P. and Girolami, M., A General
Framework for a Principled Hierarchic
Visualisation of Multivariate Data, Proc.
IDEAL02, Manchester, August 2002,to appear.
58
Hierarchical Posterior Mean Mapping
59
Hierarchy of Local Magnification Factors
60

The Latent Trait Model family as a general
framework for data visualisation
Local geometric properties of LTMs
Hierarchical LTMs
Topographic Visualisation of State Evolution in
Temporally Coherent Data. Visualizing the topic
evolution in coherent text streams
Conclusions

3 Kabán, A and Girolami, M., A Dynamic
Probabilistic Model to Visualize Topic Evolution
in Text Streams, Journal of Intelligent
Information Systems, special issue on Automated
Text Categorization Vol. 18 No 2. (March 2002
61
(No Transcript)
62
Chat line discussions from Internet relay chat
room Posteriors in time during a
discussion concentrated around a single topic
(Susan Smith) Posteriors in a time
frame containing a change of topic from general
politics to gun control
63
2-D visualisation of a chat session -- summary
mapping of the posterior means
64
Future Challenges

The most important goal for theoretical computer
science in 1950-2000 was to understand the von
Neumann computer. The most important goal for
theoretical computer science from 2000 onwards is
to understand the Internet
Christos H. Papadimitriou

Write a Comment

User Comments (0)