Title: Fast Algorithms
1Fast Algorithms Data Structures for
Visualization and Machine Learning on Massive
Data Sets
- Alexander Gray
- Fundamental Algorithmic and Statistical Tools
Laboratory (FASTlab) - Computational Science and Engineering Division
- College of Computing
- Georgia Institute of Technology
2The FASTlab
- Arkadas Ozakin Research scientist, PhD
theoretical physics - Dong Ryeol Lee PhD student, CS Math
- Ryan Riegel PhD student, CS Math
- Parikshit Ram PhD student, CS Math
- William March PhD student, Math CS
- James Waters PhD student, Physics CS
- Nadeem Syed PhD student, CS
- Hua Ouyang PhD student, CS
- Sooraj Bhat PhD student, CS
- Ravi Sastry PhD student, CS
- Long Tran PhD student, CS
- Michael Holmes PhD student, CS Physics
(co-supervised) - Nikolaos Vasiloglou PhD student, EE
(co-supervised) - Wei Guan PhD student, CS (co-supervised)
- Nishant Mehta PhD student, CS (co-supervised)
- Wee Chin Wong PhD student, ChemE (co-supervised)
- Abhimanyu Aditya MS student, CS
- Yatin Kanetkar MS student, CS
3Goal
- New displays for high-dimensional data
- Isometric non-negative matrix factorization
- Rank-based embedding
- Density-preserving maps
- Co-occurrence embedding
- New algorithms for scaling them to big datasets
- Distances Generalized Fast Multipole Method
- Dot products Cosine Trees and QUIC-SVD
- MLPACK
4Plotting high-D in 2-D
Dimension reduction beyond PCA manifolds,
embedding, etc.
5Goal
- New displays for high-dimensional data
- Isometric non-negative matrix factorization
- Rank-based embedding
- Density-preserving maps
- Co-occurrence embedding
- New algorithms for scaling them to big datasets
- Distances Generalized Fast Multipole Method
- Dot products Cosine Trees and QUIC-SVD
- MLPACK
6Isometric Non-negative Matrix Factorization
- NMF maintains the interpretability of components
of data like images or text or spectra (SDSS) - However as a low-D display it is not faithful in
general to the original distances - Isometric NMF Vasiloglou, Gray, Anderson, to be
submitted SIAM DM 2008 preserves both distances
and non-negativity geometric prog formulation
7Rank-based Embedding
- Suppose you dont really have meaningful or
reliable distances but you can say A and B are
farther apart than A and C, e.g. in document
relevance - It is still possible to make an embedding! In
fact there is some indication that using ranks is
more stable than using distances - Can be formulated using hyperkernels becomes
either an SDP or a QP Ouyang and Gray, ICML 2008
8Density-preserving Maps
- Preserving densities is statistically more
meaningful than preserving distances - Might allow more reliable conclusions from the
low-D display about clustering and outliers - DC formulation (Ozakin and Gray, to be submitted
AISTATS 2008)
9Co-occurrence Embedding
- Consider InBio data 3M occurrences of species in
Costa Rica - Densities are not reliable as the sampling
strategy is unknown - But the overlapping of two species densities
(co-occurrence) may be more reliable - How can distribution distances be embedded (Syed,
Ozakin, Gray to be submitted ICML 2009)?
10Goal
- New displays for high-dimensional data
- Isometric non-negative matrix factorization
- Rank-based embedding
- Density-preserving maps
- Co-occurrence embedding
- New algorithms for scaling them to big datasets
- Distances Generalized Fast Multipole Method
- Dot products Cosine Trees and QUIC-SVD
- MLPACK
11Computational problem
- Such manifold methods are expensive typically
O(N3) - But it is big datasets that are often the most
important to visually summarize - What are the underlying computations?
- All-k-nearest-neighbors
- Kernel summations
- Eigendecomposition
- Convex optimization
12Computational problem
- Such manifold methods are expensive typically
O(N3) - But it is big datasets that are often the most
important to visually summarize - What are the underlying computations?
- All-k-nearest-neighbors (distances)
- Kernel summations (distances)
- Eigendecomposition (dot products)
- Convex optimization (dot products)
13Distances Generalized Fast Multipole Method
- Generalized N-body Problems (Gray and Moore NIPS
2000 Riegel, Boyer, and Gray TR 2008) include - All-k-nearest-neighbors
- Kernel summations
- Force summations in physics
- A very large number of bottleneck statistics and
machine learning computations - Defined using category theory
14Distances Generalized Fast Multipole Method
- There exists a generalization (Gray and Moore
NIPS 2000 Riegel, Boyer, and Gray TR 2008) of
the Fast Multipole Method (Greengard and Rokhlin
1987) which - specializes to each of these problems
- is the fastest practical algorithm for these
problems - elucidates general principles for such problems
- Parallel THOR (Tree-based Higher-Order Reduce)
15Distances Generalized Fast Multipole Method
- Elements of the GFMM
- A spatial tree data structure, e.g. kd-trees,
metric trees, cover-trees, SVD trees, disk trees - A tree expansion pattern
- Tree-stored cached statistics
- An error criterion and pruning criterion
- A local approximation/pruning scheme with error
bounds, e.g. Hermite expansions, Monte Carlo,
exact pruning
16kd-trees most widely-used space-partitioning
tree Bentley 1975, Friedman, Bentley Finkel
1977,Moore Lee 1995
17A kd-tree level 1
18A kd-tree level 2
19A kd-tree level 3
20A kd-tree level 4
21A kd-tree level 5
22A kd-tree level 6
23Example Generalized histogram
bandwidth h
query point q
24Range-count recursive algorithm
25Range-count recursive algorithm
26Range-count recursive algorithm
27Range-count recursive algorithm
28Range-count recursive algorithm
Pruned! (inclusion)
29Range-count recursive algorithm
30Range-count recursive algorithm
31Range-count recursive algorithm
32Range-count recursive algorithm
33Range-count recursive algorithm
34Range-count recursive algorithm
35Range-count recursive algorithm
36Range-count recursive algorithm
Pruned! (exclusion)
37Range-count recursive algorithm
38Range-count recursive algorithm
39Range-count recursive algorithm
fastest practical algorithm Bentley 1975 our
algorithms can use any tree
40Dot productsCosine Trees and QUIC-SVD
- QUIC-SVD (Holmes, Gray, Isbell NIPS 2008)
- Cosine Trees Trees for dot products
- Use Monte Carlo within cosine trees to achieve
best-rank approximation with user-specified
relative error - Very fast, but with probabilistic bounds
41Dot productsCosine Trees and QUIC-SVD
- Uses of QUIC-SVD
- PCA, KPCA eigendecomposition
- Working on fast interior-point convex
optimization
42Bigger goal make all the best statistical/learnin
g methods efficient!
- Ground rules
- Asymptotic speedup as well as practical speedup
- Arbitrarily high accuracy with error guarantees
- No manual tweaking
- Really works (validated in a big real-world
problem) - Treating entire classes of methods
- Methods based on distances (generalized N-body
problems) - Methods based on dot products (linear algebra)
- Soon Methods based on discrete structures
(combinatorial/graph problems) - Watch for MLPACK, coming Dec. 2008
- Meant to be the equivalent of linear algebras
LAPACK
43So far fastest algs for
- 2000 all-nearest-neighbors (1970)
- 2000 n-point correlation functions (1950)
- 2003,05,06 kernel density estimation (1953)
- 2004 nearest-neighbor classification (1965)
- 2005,06,08 nonparametric Bayes classifier (1951)
- 2006 mean-shift clustering/tracking (1972)
- 2006 k-means clustering (1960s)
- 2007 hierarchical clustering/EMST (1960s)
- 2007 affinity propagation/clustering (2007)
44So far fastest algs for
- 2008 principal component analysis (1930s)
- 2008 local linear kernel regression (1960s)
- 2008 hidden Markov models (1970s)
- Working on
- linear regression, Kalman filters (1960s)
- Gaussian process regression (1960s)
- Gaussian graphical models (1970s)
- Manifolds, spectral clustering (2000s)
- Convex kernel machines (2000s)
45Some application highlights so far
- First large-scale dark energy confirmation top
Science breakthrough of 2003) - First large-scale cosmic magnification
confirmation (Nature, 2005) - Integration into Google image search (we think),
2005 - Integration into Microsoft SQL Server, 2008
- Working on
- Integration into Large Hadron Collider pipeline,
2008 - Fast IP-based spam filtering (Secure Comp, 2008)
- Fast recommendation (Netflix)
46To find out more
- Best way agray_at_cc.gatech.edu
- Mostly outdated www.cc.gatech.edu/agray