Living under the Curse of Dimensionality - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Living under the Curse of Dimensionality

Description:

A photo could be represented as histogram of brightness x colour (3x3) of its ... partition of B points, and C(B) is the cost of scanning a partition of B points. ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 50
Provided by: CSI755
Category:

less

Transcript and Presenter's Notes

Title: Living under the Curse of Dimensionality


1
Living under the Curse of Dimensionality
  • Dave Abel
  • CSIRO

2
Roadmap
  • Why spend time on high dimensional data?
  • Non-trivial
  • Some explanations
  • A different approach
  • Which leads to

3
The Subtext data engineering
  • Solution techniques are compositions of
    algorithms for fundamental operations
  • Algorithms assume certain contexts
  • There can be gains of orders of magnitude in
    using good algorithms that are suited to the
    context
  • Sometimes better algorithms need to be built for
    new contexts.

4
COTS Database technology
  • Simple data is handled well, even for very
    large databases and high transaction volumes, by
    relational database
  • Geospatial data (2d, 2.5d, 3d) is handled
    reasonably well
  • But pictures, series, sequences, , are poorly
    supported.

5
For example
  • Find the 10 days for which trading on the LSX was
    most similar to todays, and the pattern for the
    following day
  • Find the 20 sequences from SwissProt that are
    most similar to this one
  • If I hum the first few bars, can you fetch the
    song from the music archive?

6
Dimensionality?
  • Its all in the modelling
  • K-d means the important relationships and
    operations on these object involve a certain set
    of k attributes as a bloc
  • 1d a list/ key properties flow from value of a
    single attribute/(position in the list)
  • 2d points on a plane/ key properties and
    relationships from position on the plane
  • 3d and 4d

7
All in the modelling
  • Take a set of galaxies
  • Some physical interactions deal with galaxies as
    points in 3d (spatial) space
  • Or analyses based on the colours of galaxies
    could consider them as points in (say) 5d
    (colour) space

8
All in the modelling (gt5d)
  • Complex data types (pictures, graphs, etc) can be
    modelled as kd points using well-known tricks
  • A blinking star could be modelled by the
    histogram of its brightness
  • A photo could be represented as histogram of
    brightness x colour (3x3) of its pixels (i.e. as
    a point in 9d space)
  • A sonar echo could be modelled by the intensity
    every 10 ms after the first return.

9
Access Methods
  • Access methods structure a data set for
    efficient search
  • The standard components of a method are
  • Reduction of the data set to a set of sub-sets
    (partitions)
  • Definition of a directory (index) of partitions
    to allow traversal
  • Definition of a search algorithm that traverses
    intelligently.

10
Only a few variants on the theme
  • Space-based
  • Cells derived by a regular decomposition of the
    data space, s.t cells have nice properties
  • Points assigned to cells
  • Data-based
  • Decomposition of the data set to sub-sets, s.t.
    the sub-sets have nice properties
  • Incremental or bulk load.
  • Efficiency comes through pruning the index
    supports discovery of the partitions that need
    not be accessed.

11
kd an extension of 2d?
  • Extensive rd on (geo)spatial database 1985-1995
  • Surely kd is just a generalisation of the
    problems in 2d and 3d?
  • Analogues of 2d methods ran out of puff at about
    8d, sometimes earlier
  • Why was this? Did it matter?

12
The Curse of Dimensionality
  • Named by Bellman (1961)
  • Creep in applicability, to generally include the
    not commonsense effects that become
    increasingly awkward as the dimensionality rises
  • And the non-linearity of costs with
    dimensionality (often exponential)
  • Two examples.

13
CofD Example 1
  • Sample the space 0,1d by a grid with a spacing
    of 0.1
  • 1d 10 points
  • 2d 100 points
  • 3d 1000 points
  • 10d 10000000000 points

14
CofD Example 2
  • Determine the mean number of points within a
    hypersphere of radius r, placed randomly within
    the unit hypercube with a density of a. Lets
    assume r ltlt 1.
  • Trivial if we ignore edge effects
  • But that would be misleading

15
Edge effects?
P(edge effect) 2r (1d)
4r 4r2 (2d)
6r 12r2 8r3 (3d)
16
Which means
  • If its a uniform random distribution, a point is
    likely to be near a face (or edge) in
    high-dimensional space
  • Analyses quickly end up in intractable
    expressions
  • Usually, interesting behaviour is lost when
    models are simplified to permit neat analyses.

17
Early rumbles
  • Weber et al 1998 assertions that tree-based
    indexes will fail to prune in high-d
  • Circumstantial evidence
  • Relied on well-known comparative costs for disk
    and CPU (too generous)
  • Not a welcome report!

18
Theorem of Instability
  • Reported by Beyer et al 1999 , formalised
    extended by Shaft Ramakrishnan 2005
  • For many data distributions, all pairs of points
    are the same distance apart.

19
Contrast plot, 3 Gaussian sets
20
Which means
  • Any search method based on a contracting search
    region must fall to the performance of a naiive
    (sequential) method, sooner or later
  • This covers all (arguably) approaches devised to
    date
  • So we need to think boldly (or change our
    interests) ...

21
Target Problems
  • In high-d, operations most commonly are framed in
    terms of neighbourhoods
  • K Nearest Neighbours (kNN) query
  • kNN join
  • RkNN query.
  • In low-d, operations are most commonly framed in
    terms of ranges for attributes.

22
kNN Query
  • For this query point q, retrieve the 10 objects
    most similar to it.
  • Which requires that we define similarity,
    conventionally by a distance function
  • The query type in high-d
  • Almost ubiquitous in high-d
  • Formidable literatures.

23
kNN Join
  • For each object of a set, determine the k most
    similar points from the set
  • Encountered in data mining, classification,
    compression, .
  • A little care provides a big reward
  • Not a lot of investigation.

24
RkNN Query
  • If a new object q appears, for what objects will
    it be a k Nearest Neighbour?
  • Eg a chain of bookstores knows where its stores
    are and where its frequent-buyers live. It is
    about to open a new store in Stockbridge. For
    which frequent-buyers will the new store be
    closer than the current stores?
  • Even less investigation. High costs inhibit use.

25
Optimised Partitioning the bet
  • If we have a simple index structure and a simple
    search method, we can frame partitioning of the
    data set as an optimisation (assignment) problem
  • Although its NP-hard, we can probably solve it,
    well enough, using an iterative method
  • And it might be faster.

26
Which requires
  • We devise the access method
  • Formal statement of the problem
  • Objective function
  • Constraints.
  • Solution Technique
  • Evaluate.

27
Partitioning as the core concept
  • Reduce the data set to subsets (partitions).
  • Partitions contain a variable number of points,
    with an upper limit.
  • Partitions have a Minimum Bounding Box.

28
Index
  • The index is a list of the partitions MBBs
  • In no particular order
  • Held in RAM (and so we should impose an upper
    limit on the number of partitions).
  • I id, low, highd

29
Mindist Search Discipline
  • Fetch and scan the partitions (in a certain
    sequence), maintaining a list of the k
    candidates
  • To scan a partition,
  • Evaluate dist from each member to the query
    point
  • If better than the current kth candidate, place
    it in the list of candidates.

30
The Sequence mindist
  • Can simply evaluate the minimum distance from a
    query point to any point within an MBB (the
    mindist for a partition)
  • If we fetch in ascending mindist, we can stop
    when a mindist is greater than the distance to
    the current kth candidate
  • Conveniently, this is the optimum in terms of
    partitions fetched.

31
For example
3
C
8
4
3
A
6
A 1 (..) 6 (6) 4 (6) B 2 (6)
5 (6) Done!
1
2
B
1
5
2
Q
32
Objective Function
  • Minimise the total elapsed time of performing a
    large set of queries
  • Which requires that we have a representative set
    of queries, from an historical record or a
    generator. And we have the solutions for those
    queries.

33
The Formal Statement
Where A(B) is the cost of fetching a partition of
B points, and C(B) is the cost of scanning a
partition of B points.
34
Unit costs acquired empirically
We can plug in costs for different environments.
35
Constraints
  • All points allocated one (and only one)
    partition
  • Upper limit on points in a partition
  • Upper limit on number of partitions used.

36
Constraints
37
Finally
Which leaves us with the assignments of points to
partitions as the only decision variables.
38
The Solution Technique
  • Applies a conventional iterative refinement to an
    Initial Feasible Solution
  • The problem seems to be fairly placid
  • Acceptable load times for data sets trialled to
    date.

39
How to assess?
  • Not hard to generate meaningless performance
    data
  • Basic behaviour synthetic data (N, d, k,
    distribution)
  • Comparative real data sets
  • Benchmarks naiive method and best-previously-repo
    rted
  • Careful implementation of a naiive methods can be
    rewarding.

40
Response with N of points
41
Response with Dimensionality
42
(No Transcript)
43
(No Transcript)
44
What does it mean?
  • Can reduce times by 3, below the cutoff
  • The cutoff depends on the dataset size
  • Some conjectures drawn from the Theorems are
    based on an unrealistic model and are probably
    quantitatively wrong
  • Times for kNN queries have apparently fallen from
    50 ms to 0.5 ms. 48.5 ms is attributable to
    system caching.

45
Join? RkNN?
  • Work in progress!
  • Specialist kNN Join algorithms are well
    worthwhile
  • Optimised Partitioning for RkNN works well
  • Falls in query costs from 5 sec (or so) to 5 ms
    (or so)
  • Query join reverse is a nice package.

46
Which all suggests (Part 1)
  • Neighbourhood operations used only in a few,
    specialised geospatial apps
  • Specific data structures used
  • More general view of neighbourhood might open
    up more apps
  • Eg finding clusters of galaxies from catalogues
  • Large groups of galaxies that are bound
    gravitationally
  • Available definitions are not helpful in seeing
    clusters. The core element is high density
  • Search by neighbourhoods, rather than an
    arbitrary grid, to find high-density regions..

47
Which all suggests (Part 2)
  • Algorithms using kNN as a basic operation can be
    accelerated by (apparently) x100
  • RkNN is apparently much cheaper than we expected
    (and )
  • Designer data structures appear possible (eg
    design such that no more than 5 of transactions
    take more than 50 ms).

48
And which shows
  • There are many interesting, open problems out
    there, for data engineers
  • Using Other Peoples Techniques can be quite
    profitable
  • Data Engineers can be useful eScience team
    members.

49
More?
  • dave.abel_at_csiro.au
Write a Comment
User Comments (0)
About PowerShow.com