Title: Machine Learning for Data Mining
1Machine Learning for Data Mining
- Ata Kaban
- The University of Birmingham
2Overview
- Branches of machine learning
- Roadmap of unsupervised learning
- Some example applications from my research work
3Machine learning
- Data y1, y2, (e.g. sensory inputs)
- Supervised learning
- The machine is given desired outputs z1, z2,
attached to the input data - Goal learn to produce the correct output for a
new input - Reinforcement learning
- The machine can also produce actions, which
affect the state of the environment (the data),
and receives rewards / punishments. - Goal Learn to act in a way that maximises
rewards in the long term. - Unsupervised learning
- Goal build a model of the data, which can be
used for reasoning, explanation, decision
making, prediction, etc. or to make other tasks
easier
4Overview
- Branches of machine learning
- Roadmap of unsupervised learning
- Some example applications from my research work
5Unsupervised learning
- Goals Finding useful representations
- Finding clusters
- Dimensionality reduction
- Building topographic maps
- Finding hidden causes that explain the data
- Modelling the data density
- Uses
- Data compression
- Outlier detection
- Make other learning tasks easier
- A theory of human learning perception
6More examples of problems
a more appropriate model should consider some
conceptual dimensions instead of words.
(Gardenfors)
- Finding topics, meanings, intentions
- The two choice questionnaire problem
- Word saliency
7Latent variable models a useful formalism
- Hidden (latent) variables
- Topics
- Intentions
- Relative importance
- Observed variables
- Stream
- of words
- (with grammar syntax)
- Assuming that there is a systematic relationship
between these two categories of variables, - We try to find out the hidden variables of
interest from the observed variables.
8- Modelling and inference
-
- modelling
- Specify how the hidden variables of interest
might have generated the observed variables - Infer the inverse stochastic mapping from hidden
variables to observed variables
inference
data
Hiddencauses
stochastic mapping
9- Probabilistic latent variable models
latent variable
observed data
noise
parameters
Linear models with Gaussian latent prior FA,
PPCA, PCA Finite mixture models Linear models
with non-Gaussian latent prior IFA, ICA,
PP Non-linear models local liner, explicit
non-linear GTM
10what are those acronyms
- These are some classical and quite useful
techniques for data analysis - FA Factor Analysis
- PCA Principal Component Analysis
- PPCA Probabilistic Principal Component Analysis
- MoG Mixture of Gaussians
- IFA Independent Factor Analysis
- ICA Independent Component Analysis
- GTM Generative Topographic Mapping
11Applications of latent variable models
- Tools for discovering latent structure in the
observable data - Data analysis visualisation
- Application domains
- Data mining,
- Tele-communications
- Bio-informatics
- Fraud detection
- Information retrieval
- Marketing analysis
123 basic types of latent variable models the
intuition
13Linear latent variable models
- I. Models with Gaussian latent prior p(x)N(0,I)
- Factor Analysis (FA) nN(0,S), S diagonal
- Probabilistic Principal Components Analysis
(PPCA) n(0,s2I) ? PCA n d(0) - II. Mixture models p(x)Skd(x-xk)p(xk)
- Mixture of Gaussians (MoG) nN(0,S) or (0,s2I)
- III. Models with non-Gaussian (e.g. sparse)
latent priors p(x) non-Gaussian - Independent Factor Analysis (IFA) nN(0,S),
p(x)SkNx(0,s) p(k) - Independent Component Analysis (ICA) n d(0),
p(x) non-Gaussian, e.g. Laplace
- Compression - Visualisation
- Clustering
- Signal separation
- Structure discovery
- Visualisation
- Clustering
14 D UUT D
Compression with PCA
15Text Data Compression can capture synonymy
16 Term x Documents Matrix
17 This is called LSA
Latent Semantic Analysis Querytheory
application
It is performed by SVD Singular Value
Decomposition, which is closely related to PCA
Project documents E-1 UT D and query words E-1
VT DT in the same space
Query (small) document
18Clustering
- of users web browsing behaviours from Internet
Information Server logs for msnbc.com in one days
time. - 17 page categories frontpage, news, tech, local,
opinion, on-air, misc, weather, health, living,
business, sports, summary, bbs, travel, msn-news,
msn-sports - Example data
- 1 1
- 2
- 3 2 2 4 2 2 2 3 3
- 5
- 6 7 7 7 6 6 8 8 8 8
- etc
19(No Transcript)
20Non-Gaussian Latent Variable Models ICA
solving inverse problems
- Blind source separation (the coctail party
problem) - Image denoising
- Medical signal processing fMRI, ECG, EEG
- Modelling of the visual cortex
- Feature extraction for face recognition
- Compression, redundancy reduction
- Clustering
- Time series analysis
21The Cocktail Party Problem
Original (hidden) sources
22Observations Linear mixtures of the sources
23Recovered sources
24Overview
- Branches of machine learning
- Roadmap of unsupervised learning
- Further examples applications
25Galaxy spectra
- Elliptical galaxies
- oldest galactic systems
- believed to consist of a single population of
old stars - recent theories indicate the presence of younger
populations of stars - what does the data tell us?
26 What does the data tell us?
- A Kaban, L Nolan and S Raychaudhury. Finding
Young Stellar Populations in Elliptical Galaxies
from Independent Components of Optical Spectra.
Proc. SIAM International Conference on Data
Mining (SDM05), 2005. - LA Nolan, M Harva, A Kaban and S Raychaudhury, A
data-driven Bayesian approach to finding young
stellar populations in early-type galaxies from
their UV-optical spectra, Mon. Not. of the Royal
Astron. Soc. (MNRAS), 366(1), pp. 321-338. - LA Nolan, S Raychaudhury, A Kaban. Young stellar
populations in early-type galaxies in the Sloan
Digital Sky Survey, accepted to MNRAS.
27(No Transcript)
28Reference A Kaban X Wang ECML06
Finding communities from a dynamic network
- An analogy Deconvolutive Source Separation (aka
the coctail party problem)
- Microphone(s) record a mixture of signals
- Convolutive mixing due to echo
- Task is to recover the individual signals
- Studied in continuous signal processing
29Computer-mediated discussion
- Convolution occurs due to various time delay
factors network transmission bandwidth,
differences in speed of typing
- The activity is logged as a sequence of discrete
events - We try to model user participation dynamics based
on this sequence
30Dynamic Social Networks
Example of activity log from an IRC Chatroom
31Results
Observed 1st order connectivity
Analysis by our Deconvolutive State Clustering
model
32The clusters (comunities) developing over time
Note bursts of activity are being discovered
clusters
time
33Scaling of our algorithm
34Reference X Wang and A Kaban Proc. Discovery
Science 2006.
Model based inference of word saliency from text
Lowest saliency words from the 20-Newsgroups
corpus
35Interpretation of a piece of text from
talk.politics.mideast underlined
saliencygt0.8 normal font saliencies between
0.40- 0.8 grey saliency lt0.4
36Induced geometry of the topical and common word
distributions
?1
?
colour the estimated saliency
?2
37Improved text classification
Data sets
Classification results
38Modelling inhomogeneous documents
- Multinomial model components
- PLSA, LDA
- Independent Bernoulli model components
- Aspect Bernoulli
39Reference Blei et al LDA. JMLR, 2003
40What are the Bernoulli components?
Ref A Kaban E Bingham Proc. SDM03E. Bingham
A Kaban Submitted to JAIR
Word presences normally tend to have topical
causes. Word absences also have non-topical
causes (a noise factor) Note in terrorist
messages the word presences might have
non-topical causes too -)
41What does the non-topical factor tell us?
- Given a document, which are the word absences
that have the posterior P(phantomn,t,xtn)
highest (amongst all the posteriors of other
factors)? - Equivalently, we can think of it as removing the
noise factor.
42(No Transcript)
43A visual example Explaining each pixel
Example input data instances
Components identified from the data (Beta
posterior expectations)
Explanation of each pixel value of each data
instance in terms of how likely is it explained
by any of the components. Darker means higher
probability. Note that the white pixels in the
corners of a raster are explained by
content-bearing components whereas the occluded
pixels come from a phantom-component.
44Predictive modelling of heterogeneous sequence
collections
Reference A. Kaban Predictive Modelling of
Heterogeneous Sequence Collections by Topographic
Ordering of Histories. Machine Learning (accepted
subject to minor revisions)
45- How to model heterogeneous behaviour?
- Shared behavioural patterns (analogous to
procedures of computer programs) - These are the basis of multiple relationships
between users and groups of users - Existing models are either global or assume
homogeneous prototypical behaviour within groups
46Example
- A set of 1-st order Markov Chains combine to
generate sequences by interleaving in various
proportions of participation - Task
- Estimate shared generator-chains
- Infer the proportions of participation
47Prototypes vs aspects in the model
Can be shown that the model estimation algorithm
minimises a weighted sum of entropies of the
parameters.
48Robust to sample size issues outperforms the
state-of- the art
49A summary overview of the large sequence
collection in terms of lists of most probable
sequences at equal locations of the map
50Instead of conclusions
All models are wrong but some are useful (Cox)
- Data generated by natural processes typically
contain redundancy - One can throw away a lot of detail from the data
and still keep essential features - Simple models can be successful
- Complicated models may need infeasible amounts of
data to estimate
51old stuff
- The Latent Trait Model family as a general
framework for data visualisation - Local geometric properties of LTMs
- Hierarchical LTMs
- Topographic visualisation of state evolution
- Conclusions
1 Kabán, A and Girolami, M., A Combined Latent
Class and Trait Model for the Analysis and
Visualisation of Discrete Data..IEEE Transactions
on Pattern Analysis and Machine Intelligence,
23(8), pp. 859872, 2001.
52The Generative Latent Trait Model Family
non-linear
Latent space (space of the hidden structure)
Observable multi-dimensional data space
Aim infer and visualise the structure of the
data as much as possible
53Text based document representation modelling
54(No Transcript)
55- The Latent Trait Model family as a general
framework for data visualisation - Local Magnification Factors of the LT Manifolds
- Hierarchical LTMs
- Topographic visualisation of state evolution
- Conclusions
56Magnification factors and curvatures of the
Bernoulli Trait manifold on handwritten digits
data
57- Brief introduction to generative and latent
variable models - Text generation models
- The Latent Trait Model family as a general
framework for data visualisation - Local geometric properties of the LT manifolds
- Hierarchical LTMs
- Topographic visualisation of state evolution
- Conclusions
2 Kabán, A Tino, P. and Girolami, M., A General
Framework for a Principled Hierarchic
Visualisation of Multivariate Data, Proc.
IDEAL02, Manchester, August 2002,to appear.
58Hierarchical Posterior Mean Mapping
59Hierarchy of Local Magnification Factors
60- The Latent Trait Model family as a general
framework for data visualisation - Local geometric properties of LTMs
- Hierarchical LTMs
- Topographic Visualisation of State Evolution in
Temporally Coherent Data. Visualizing the topic
evolution in coherent text streams - Conclusions
3 Kabán, A and Girolami, M., A Dynamic
Probabilistic Model to Visualize Topic Evolution
in Text Streams, Journal of Intelligent
Information Systems, special issue on Automated
Text Categorization Vol. 18 No 2. (March 2002
61(No Transcript)
62Chat line discussions from Internet relay chat
room      Posteriors in time during a
discussion concentrated around a single topic
(Susan Smith) Â Â Â Posteriors in a time
frame containing a change of topic from general
politics to gun control  Â
632-D visualisation of a chat session -- summary
mapping of the posterior means
64Future Challenges
- The most important goal for theoretical computer
science in 1950-2000 was to understand the von
Neumann computer. The most important goal for
theoretical computer science from 2000 onwards is
to understand the Internet - Christos H. Papadimitriou