Title: Exploring similarities and dissimilarities of objects
1Exploring similarities and dissimilarities of
objects
- Distance-preserving 2-dimensional scatter charts
- MDS scatter charts
- Variation-preserving 2-dimensional score charts
- PCA score charts
- Cluster analysis
- Unconstrained or constrained K-means clustering
2Multidimensional scaling
- Primary objective
- Fit the original data into a low-dimensional
space so that the distortion is minimized - Problem formulation
- For a given set of N items, find a representation
in few dimensions such that the N(N-1)/2
interitem similarities (distances) nearly match
the original similarities (distances). - Terminology
- The resulting low-dimensional plot is called an
ordination of the data - The numerical measure of closeness of the
low-dimensional representation is called stress
3Metric and nonmetric multidimensional scaling
- Metric multidimensional scaling (principal
coordinate analysis) - Multidimensional scaling based on the actual
magnitudes of the original similarities - Nonmetric multidimensional scaling
- Multidimensional scaling based on the rank
orders of the N(N-1)/2 original similarities
4Metric multidimensional scaling- a simple example
- Geometrical representation of cities produced by
MDS and an airline distance table
5Metric multidimensional scaling- a simple
example of SAS-code
- Geometrical representation of cities produced by
MDS and an airline distance table
proc mds datamds.uscities dimension2
outmds.uzcitiesout oconfig run
6Metric multidimensional scaling- a simple example
- Geometrical representation of cities produced by
MDS and an airline distance table
Spokane
Boston
Los Angeles
7Metric and nonmetric multidimensional scaling-
general problem formulation
- Consider a set of N items, and an ordering
- the M N(N-1)/2 interitem similarities.
-
- We want to find a q-dimensional configuration
such that the interitem distances match this
ordering. A perfect match occurs when -
- As long as the order is preserved, the
magnitudes of the distances are considered to be
irrelevant. -
8Minimization of stress
- We would like to find a q-dimensional
representation such that - is minimized where the are reference
numbers that are monotonically related to the
observed similarities
9Interpretation of stress levels
- The stress
- is always between 0 and 1. As q increases, the
stress will decrease, and it will be zero for for
q N-1 - Any stress value less than 0.1 is typically
taken to mean that the representation is good.
10Multidimensional scaling- software options
SAS proc distance followed by proc mds ggobi
11Identifying outliers and anomalies
- Simple filtering of raw data
- Analysis of residuals derived from prediction
models - Principal components score charts
12Detection of anomalies in surface ozone
concentrationsrecorded at Ähtäri, Finland, at
1200
The ozone concentrations are correlated to
several meteorological variables
13PLS-normalised and seasonally adjusted
concentrations of surface ozone at Ähtäri,
Finland(PLS Partial Least Squares Regression)
- PLS-normalisation with respect to contemporaneous
data regarding - temperature
- humidity
- wind direction
- wind speed
- measured at a network of stations
- Separate PLS-normalisations for each month
14Modelling ln daily electricity consumption as a
spline function of the population-weighted mean
temperature in Sweden residual analysis
15From multiple time series of data to a smooth
trend surface
16Smoothing of the trend function in models of time
series data representing several sites along a
gradient
Spatial smoothing along a gradient Temporal
smoothing across years
17Smoothing of the trend function in models of time
series data representing several seasons
Sequential smoothing across seasons Temporal
smoothing across years
18Smoothing of the trend function in models of time
series data representing several sectors
Circular smoothing across sectors Temporal
smoothing across years
19Remark
- Different types of time series data may require
different types of smoothing
20A simple model for simultaneous smoothingand
adjustment for a single covariate
- Let be the observed response for the jth
coordinate the ith year, - and let denote a contemporaneous value of
a covariate. Assume that -
-
- .
Deterministic trend
Impact of covariate
Random error
Response
21A semiparametric model for simultaneous
smoothingand adjustment for several covariates
- Let be the observed response for the jth
class the ith year, - and let denote
contemporaneous values of covariates. - Assume that
-
-
- .
Random error
Deterministic trend
Response
Impact of covariate
Impact of covariate
22Gradient smoothing
Penalty of irregular interannual variation
Penalty of irregular variation along the gradient
23Smoothing of the trend function in models of time
series data representing several sites along a
gradient
Spatial smoothing along a gradient Temporal
smoothing across years
24Smoothing of the trend function in models of time
series data representing several seasons
Sequential smoothing across seasons Temporal
smoothing across years
25Smoothing of the trend function in models of time
series data representing several sectors
Circular smoothing across sectors Temporal
smoothing across years
26Satellite image of bluegreen algae
(cyanobacteria) in the Baltic Sea, summer 2005
Finland
Sweden
Baltic Sea
Algae bloom
Estonia
Latvia
27Sampling sites for water quality in the Stockholm
archipelago
Baltic Sea
Stockholm
28Secchi depth (water clarity) in the inner
Stockholm archipelago
Investments in improved nitrogen removal
29Secchi depth (water clarity) at three stations in
the inner Stockholm archipelago
Water clarity varies strongly within sites
30Secchi depth (water clarity) at three stations in
the inner Stockholm archipelago
Water clarity varies with water temperature
31Trend surface for salinity and temperature
normalized Secchi depth data along a salinity
gradient in the inner Stockholm archipelago