Living under the Curse of Dimensionality - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Living under the Curse of Dimensionality

Description:

A photo could be represented as histogram of brightness x colour (3x3) of its ... partition of B points, and C(B) is the cost of scanning a partition of B points. ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 50

Provided by: CSI755

Category:

more less

Transcript and Presenter's Notes

Title: Living under the Curse of Dimensionality

1
Living under the Curse of Dimensionality

Dave Abel
CSIRO

2
Roadmap

Why spend time on high dimensional data?
Non-trivial
Some explanations
A different approach
Which leads to

3
The Subtext data engineering

Solution techniques are compositions of
algorithms for fundamental operations
Algorithms assume certain contexts
There can be gains of orders of magnitude in
using good algorithms that are suited to the
context
Sometimes better algorithms need to be built for
new contexts.

4
COTS Database technology

Simple data is handled well, even for very
large databases and high transaction volumes, by
relational database
Geospatial data (2d, 2.5d, 3d) is handled
reasonably well
But pictures, series, sequences, , are poorly
supported.

5
For example

Find the 10 days for which trading on the LSX was
most similar to todays, and the pattern for the
following day
Find the 20 sequences from SwissProt that are
most similar to this one
If I hum the first few bars, can you fetch the
song from the music archive?

6
Dimensionality?

Its all in the modelling
K-d means the important relationships and
operations on these object involve a certain set
of k attributes as a bloc
1d a list/ key properties flow from value of a
single attribute/(position in the list)
2d points on a plane/ key properties and
relationships from position on the plane
3d and 4d

7
All in the modelling

Take a set of galaxies
Some physical interactions deal with galaxies as
points in 3d (spatial) space
Or analyses based on the colours of galaxies
could consider them as points in (say) 5d
(colour) space

8
All in the modelling (gt5d)

Complex data types (pictures, graphs, etc) can be
modelled as kd points using well-known tricks
A blinking star could be modelled by the
histogram of its brightness
A photo could be represented as histogram of
brightness x colour (3x3) of its pixels (i.e. as
a point in 9d space)
A sonar echo could be modelled by the intensity
every 10 ms after the first return.

9
Access Methods

Access methods structure a data set for
efficient search
The standard components of a method are
Reduction of the data set to a set of sub-sets
(partitions)
Definition of a directory (index) of partitions
to allow traversal
Definition of a search algorithm that traverses
intelligently.

10
Only a few variants on the theme

Space-based
Cells derived by a regular decomposition of the
data space, s.t cells have nice properties
Points assigned to cells
Data-based
Decomposition of the data set to sub-sets, s.t.
the sub-sets have nice properties
Incremental or bulk load.
Efficiency comes through pruning the index
supports discovery of the partitions that need
not be accessed.

11
kd an extension of 2d?

Extensive rd on (geo)spatial database 1985-1995
Surely kd is just a generalisation of the
problems in 2d and 3d?
Analogues of 2d methods ran out of puff at about
8d, sometimes earlier
Why was this? Did it matter?

12
The Curse of Dimensionality

Named by Bellman (1961)
Creep in applicability, to generally include the
not commonsense effects that become
increasingly awkward as the dimensionality rises
And the non-linearity of costs with
dimensionality (often exponential)
Two examples.

13
CofD Example 1

Sample the space 0,1d by a grid with a spacing
of 0.1
1d 10 points
2d 100 points
3d 1000 points
10d 10000000000 points

14
CofD Example 2

Determine the mean number of points within a
hypersphere of radius r, placed randomly within
the unit hypercube with a density of a. Lets
assume r ltlt 1.
Trivial if we ignore edge effects
But that would be misleading

15
Edge effects?
P(edge effect) 2r (1d)
4r 4r2 (2d)
6r 12r2 8r3 (3d)
16
Which means

If its a uniform random distribution, a point is
likely to be near a face (or edge) in
high-dimensional space
Analyses quickly end up in intractable
expressions
Usually, interesting behaviour is lost when
models are simplified to permit neat analyses.

17
Early rumbles

Weber et al 1998 assertions that tree-based
indexes will fail to prune in high-d
Circumstantial evidence
Relied on well-known comparative costs for disk
and CPU (too generous)
Not a welcome report!

18
Theorem of Instability

Reported by Beyer et al 1999 , formalised
extended by Shaft Ramakrishnan 2005
For many data distributions, all pairs of points
are the same distance apart.

19
Contrast plot, 3 Gaussian sets
20
Which means

Any search method based on a contracting search
region must fall to the performance of a naiive
(sequential) method, sooner or later
This covers all (arguably) approaches devised to
date
So we need to think boldly (or change our
interests) ...

21
Target Problems

In high-d, operations most commonly are framed in
terms of neighbourhoods
K Nearest Neighbours (kNN) query
kNN join
RkNN query.
In low-d, operations are most commonly framed in
terms of ranges for attributes.

22
kNN Query

For this query point q, retrieve the 10 objects
most similar to it.
Which requires that we define similarity,
conventionally by a distance function
The query type in high-d
Almost ubiquitous in high-d
Formidable literatures.

23
kNN Join

For each object of a set, determine the k most
similar points from the set
Encountered in data mining, classification,
compression, .
A little care provides a big reward
Not a lot of investigation.

24
RkNN Query

If a new object q appears, for what objects will
it be a k Nearest Neighbour?
Eg a chain of bookstores knows where its stores
are and where its frequent-buyers live. It is
about to open a new store in Stockbridge. For
which frequent-buyers will the new store be
closer than the current stores?
Even less investigation. High costs inhibit use.

25
Optimised Partitioning the bet

If we have a simple index structure and a simple
search method, we can frame partitioning of the
data set as an optimisation (assignment) problem
Although its NP-hard, we can probably solve it,
well enough, using an iterative method
And it might be faster.

26
Which requires

We devise the access method
Formal statement of the problem
Objective function
Constraints.
Solution Technique
Evaluate.

27
Partitioning as the core concept

Reduce the data set to subsets (partitions).
Partitions contain a variable number of points,
with an upper limit.
Partitions have a Minimum Bounding Box.

28
Index

The index is a list of the partitions MBBs
In no particular order
Held in RAM (and so we should impose an upper
limit on the number of partitions).
I id, low, highd

29
Mindist Search Discipline

Fetch and scan the partitions (in a certain
sequence), maintaining a list of the k
candidates
To scan a partition,
Evaluate dist from each member to the query
point
If better than the current kth candidate, place
it in the list of candidates.

30
The Sequence mindist

Can simply evaluate the minimum distance from a
query point to any point within an MBB (the
mindist for a partition)
If we fetch in ascending mindist, we can stop
when a mindist is greater than the distance to
the current kth candidate
Conveniently, this is the optimum in terms of
partitions fetched.

31
For example
3
C
8
4
3
A
6
A 1 (..) 6 (6) 4 (6) B 2 (6)
5 (6) Done!
1
2
B
1
5
2
Q
32
Objective Function

Minimise the total elapsed time of performing a
large set of queries
Which requires that we have a representative set
of queries, from an historical record or a
generator. And we have the solutions for those
queries.

33
The Formal Statement
Where A(B) is the cost of fetching a partition of
B points, and C(B) is the cost of scanning a
partition of B points.
34
Unit costs acquired empirically
We can plug in costs for different environments.
35
Constraints

All points allocated one (and only one)
partition
Upper limit on points in a partition
Upper limit on number of partitions used.

36
Constraints
37
Finally
Which leaves us with the assignments of points to
partitions as the only decision variables.
38
The Solution Technique

Applies a conventional iterative refinement to an
Initial Feasible Solution
The problem seems to be fairly placid
Acceptable load times for data sets trialled to
date.

39
How to assess?

Not hard to generate meaningless performance
data
Basic behaviour synthetic data (N, d, k,
distribution)
Comparative real data sets
Benchmarks naiive method and best-previously-repo
rted
Careful implementation of a naiive methods can be
rewarding.

40
Response with N of points
41
Response with Dimensionality
42
(No Transcript)
43
(No Transcript)
44
What does it mean?

Can reduce times by 3, below the cutoff
The cutoff depends on the dataset size
Some conjectures drawn from the Theorems are
based on an unrealistic model and are probably
quantitatively wrong
Times for kNN queries have apparently fallen from
50 ms to 0.5 ms. 48.5 ms is attributable to
system caching.

45
Join? RkNN?

Work in progress!
Specialist kNN Join algorithms are well
worthwhile
Optimised Partitioning for RkNN works well
Falls in query costs from 5 sec (or so) to 5 ms
(or so)
Query join reverse is a nice package.

46
Which all suggests (Part 1)

Neighbourhood operations used only in a few,
specialised geospatial apps
Specific data structures used
More general view of neighbourhood might open
up more apps
Eg finding clusters of galaxies from catalogues
Large groups of galaxies that are bound
gravitationally
Available definitions are not helpful in seeing
clusters. The core element is high density
Search by neighbourhoods, rather than an
arbitrary grid, to find high-density regions..

47
Which all suggests (Part 2)

Algorithms using kNN as a basic operation can be
accelerated by (apparently) x100
RkNN is apparently much cheaper than we expected
(and )
Designer data structures appear possible (eg
design such that no more than 5 of transactions
take more than 50 ms).

48
And which shows