Title: Living under the Curse of Dimensionality
1Living under the Curse of Dimensionality
2Roadmap
- Why spend time on high dimensional data?
- Non-trivial
- Some explanations
- A different approach
- Which leads to
3The Subtext data engineering
- Solution techniques are compositions of
algorithms for fundamental operations - Algorithms assume certain contexts
- There can be gains of orders of magnitude in
using good algorithms that are suited to the
context - Sometimes better algorithms need to be built for
new contexts.
4COTS Database technology
- Simple data is handled well, even for very
large databases and high transaction volumes, by
relational database - Geospatial data (2d, 2.5d, 3d) is handled
reasonably well - But pictures, series, sequences, , are poorly
supported.
5For example
- Find the 10 days for which trading on the LSX was
most similar to todays, and the pattern for the
following day - Find the 20 sequences from SwissProt that are
most similar to this one - If I hum the first few bars, can you fetch the
song from the music archive?
6Dimensionality?
- Its all in the modelling
- K-d means the important relationships and
operations on these object involve a certain set
of k attributes as a bloc - 1d a list/ key properties flow from value of a
single attribute/(position in the list) - 2d points on a plane/ key properties and
relationships from position on the plane - 3d and 4d
7All in the modelling
- Take a set of galaxies
- Some physical interactions deal with galaxies as
points in 3d (spatial) space - Or analyses based on the colours of galaxies
could consider them as points in (say) 5d
(colour) space
8All in the modelling (gt5d)
- Complex data types (pictures, graphs, etc) can be
modelled as kd points using well-known tricks - A blinking star could be modelled by the
histogram of its brightness - A photo could be represented as histogram of
brightness x colour (3x3) of its pixels (i.e. as
a point in 9d space) - A sonar echo could be modelled by the intensity
every 10 ms after the first return.
9Access Methods
- Access methods structure a data set for
efficient search - The standard components of a method are
- Reduction of the data set to a set of sub-sets
(partitions) - Definition of a directory (index) of partitions
to allow traversal - Definition of a search algorithm that traverses
intelligently.
10Only a few variants on the theme
- Space-based
- Cells derived by a regular decomposition of the
data space, s.t cells have nice properties - Points assigned to cells
- Data-based
- Decomposition of the data set to sub-sets, s.t.
the sub-sets have nice properties - Incremental or bulk load.
- Efficiency comes through pruning the index
supports discovery of the partitions that need
not be accessed.
11kd an extension of 2d?
- Extensive rd on (geo)spatial database 1985-1995
- Surely kd is just a generalisation of the
problems in 2d and 3d? - Analogues of 2d methods ran out of puff at about
8d, sometimes earlier - Why was this? Did it matter?
12The Curse of Dimensionality
- Named by Bellman (1961)
- Creep in applicability, to generally include the
not commonsense effects that become
increasingly awkward as the dimensionality rises - And the non-linearity of costs with
dimensionality (often exponential) - Two examples.
13CofD Example 1
- Sample the space 0,1d by a grid with a spacing
of 0.1 - 1d 10 points
- 2d 100 points
- 3d 1000 points
-
- 10d 10000000000 points
14CofD Example 2
- Determine the mean number of points within a
hypersphere of radius r, placed randomly within
the unit hypercube with a density of a. Lets
assume r ltlt 1. - Trivial if we ignore edge effects
- But that would be misleading
15Edge effects?
P(edge effect) 2r (1d)
4r 4r2 (2d)
6r 12r2 8r3 (3d)
16Which means
- If its a uniform random distribution, a point is
likely to be near a face (or edge) in
high-dimensional space - Analyses quickly end up in intractable
expressions - Usually, interesting behaviour is lost when
models are simplified to permit neat analyses.
17Early rumbles
- Weber et al 1998 assertions that tree-based
indexes will fail to prune in high-d - Circumstantial evidence
- Relied on well-known comparative costs for disk
and CPU (too generous) - Not a welcome report!
18Theorem of Instability
- Reported by Beyer et al 1999 , formalised
extended by Shaft Ramakrishnan 2005 - For many data distributions, all pairs of points
are the same distance apart.
19Contrast plot, 3 Gaussian sets
20Which means
- Any search method based on a contracting search
region must fall to the performance of a naiive
(sequential) method, sooner or later - This covers all (arguably) approaches devised to
date - So we need to think boldly (or change our
interests) ...
21Target Problems
- In high-d, operations most commonly are framed in
terms of neighbourhoods - K Nearest Neighbours (kNN) query
- kNN join
- RkNN query.
- In low-d, operations are most commonly framed in
terms of ranges for attributes.
22kNN Query
- For this query point q, retrieve the 10 objects
most similar to it. - Which requires that we define similarity,
conventionally by a distance function - The query type in high-d
- Almost ubiquitous in high-d
- Formidable literatures.
23kNN Join
- For each object of a set, determine the k most
similar points from the set - Encountered in data mining, classification,
compression, . - A little care provides a big reward
- Not a lot of investigation.
24RkNN Query
- If a new object q appears, for what objects will
it be a k Nearest Neighbour? - Eg a chain of bookstores knows where its stores
are and where its frequent-buyers live. It is
about to open a new store in Stockbridge. For
which frequent-buyers will the new store be
closer than the current stores? - Even less investigation. High costs inhibit use.
25Optimised Partitioning the bet
- If we have a simple index structure and a simple
search method, we can frame partitioning of the
data set as an optimisation (assignment) problem - Although its NP-hard, we can probably solve it,
well enough, using an iterative method - And it might be faster.
26Which requires
- We devise the access method
- Formal statement of the problem
- Objective function
- Constraints.
- Solution Technique
- Evaluate.
27Partitioning as the core concept
- Reduce the data set to subsets (partitions).
- Partitions contain a variable number of points,
with an upper limit. - Partitions have a Minimum Bounding Box.
28Index
- The index is a list of the partitions MBBs
- In no particular order
- Held in RAM (and so we should impose an upper
limit on the number of partitions). - I id, low, highd
29Mindist Search Discipline
- Fetch and scan the partitions (in a certain
sequence), maintaining a list of the k
candidates - To scan a partition,
- Evaluate dist from each member to the query
point - If better than the current kth candidate, place
it in the list of candidates.
30The Sequence mindist
- Can simply evaluate the minimum distance from a
query point to any point within an MBB (the
mindist for a partition) - If we fetch in ascending mindist, we can stop
when a mindist is greater than the distance to
the current kth candidate - Conveniently, this is the optimum in terms of
partitions fetched.
31For example
3
C
8
4
3
A
6
A 1 (..) 6 (6) 4 (6) B 2 (6)
5 (6) Done!
1
2
B
1
5
2
Q
32Objective Function
- Minimise the total elapsed time of performing a
large set of queries - Which requires that we have a representative set
of queries, from an historical record or a
generator. And we have the solutions for those
queries.
33The Formal Statement
Where A(B) is the cost of fetching a partition of
B points, and C(B) is the cost of scanning a
partition of B points.
34Unit costs acquired empirically
We can plug in costs for different environments.
35Constraints
- All points allocated one (and only one)
partition - Upper limit on points in a partition
- Upper limit on number of partitions used.
36Constraints
37Finally
Which leaves us with the assignments of points to
partitions as the only decision variables.
38The Solution Technique
- Applies a conventional iterative refinement to an
Initial Feasible Solution - The problem seems to be fairly placid
- Acceptable load times for data sets trialled to
date.
39How to assess?
- Not hard to generate meaningless performance
data - Basic behaviour synthetic data (N, d, k,
distribution) - Comparative real data sets
- Benchmarks naiive method and best-previously-repo
rted - Careful implementation of a naiive methods can be
rewarding.
40Response with N of points
41Response with Dimensionality
42(No Transcript)
43(No Transcript)
44What does it mean?
- Can reduce times by 3, below the cutoff
- The cutoff depends on the dataset size
- Some conjectures drawn from the Theorems are
based on an unrealistic model and are probably
quantitatively wrong - Times for kNN queries have apparently fallen from
50 ms to 0.5 ms. 48.5 ms is attributable to
system caching.
45Join? RkNN?
- Work in progress!
- Specialist kNN Join algorithms are well
worthwhile - Optimised Partitioning for RkNN works well
- Falls in query costs from 5 sec (or so) to 5 ms
(or so) - Query join reverse is a nice package.
46Which all suggests (Part 1)
- Neighbourhood operations used only in a few,
specialised geospatial apps - Specific data structures used
- More general view of neighbourhood might open
up more apps - Eg finding clusters of galaxies from catalogues
- Large groups of galaxies that are bound
gravitationally - Available definitions are not helpful in seeing
clusters. The core element is high density - Search by neighbourhoods, rather than an
arbitrary grid, to find high-density regions..
47Which all suggests (Part 2)
- Algorithms using kNN as a basic operation can be
accelerated by (apparently) x100 - RkNN is apparently much cheaper than we expected
(and ) - Designer data structures appear possible (eg
design such that no more than 5 of transactions
take more than 50 ms).
48And which shows
- There are many interesting, open problems out
there, for data engineers - Using Other Peoples Techniques can be quite
profitable - Data Engineers can be useful eScience team
members.
49More?