Title: Fast Algorithms for Analyzing Massive Data
1Fast Algorithms for Analyzing Massive Data
- Alexander Gray
- Georgia Institute of Technology
- www.fast-lab.org
2The FASTlabFundamental Algorithmic and
Statistical Tools Laboratorywww.fast-lab.org
- Alexander Gray Assoc Prof, Applied Math CS
PhD CS - Arkadas Ozakin Research Scientist, Math
Physics PhD Physics - Dongryeol Lee PhD student, CS Math
- Ryan Riegel PhD student, CS Math
- Sooraj Bhat PhD student, CS
- Nishant Mehta PhD student, CS
- Parikshit Ram PhD student, CS Math
- William March PhD student, Math CS
- Hua Ouyang PhD student, CS
- Ravi Sastry PhD student, CS
- Long Tran PhD student, CS
- Ryan Curtin PhD student, EE
- Ailar Javadi PhD student, EE
- Anita Zakrzewska PhD student, CS
- 5-10 MS students and undergraduates
37 tasks ofmachine learning / data mining
- Querying spherical range-search O(N), orthogonal
range-search O(N), nearest-neighbor O(N),
all-nearest-neighbors O(N2) - Density estimation mixture of Gaussians, kernel
density estimation O(N2), kernel conditional
density estimation O(N3) - Classification decision tree, nearest-neighbor
classifier O(N2), kernel discriminant analysis
O(N2), support vector machine O(N3) , Lp SVM - Regression linear regression, LASSO, kernel
regression O(N2), Gaussian process regression
O(N3) - Dimension reduction PCA, non-negative matrix
factorization, kernel PCA O(N3), maximum variance
unfolding O(N3) Gaussian graphical models,
discrete graphical models - Clustering k-means, mean-shift O(N2),
hierarchical (FoF) clustering O(N3) - Testing and matching MST O(N3), bipartite
cross-matching O(N3), n-point correlation
2-sample testing O(Nn), kernel embedding
47 tasks ofmachine learning / data mining
- Querying spherical range-search O(N), orthogonal
range-search O(N), nearest-neighbor O(N),
all-nearest-neighbors O(N2) - Density estimation mixture of Gaussians, kernel
density estimation O(N2), kernel conditional
density estimation O(N3) - Classification decision tree, nearest-neighbor
classifier O(N2), kernel discriminant analysis
O(N2), support vector machine O(N3), Lp SVM - Regression linear regression, LASSO, kernel
regression O(N2), Gaussian process regression
O(N3) - Dimension reduction PCA, non-negative matrix
factorization, kernel PCA O(N3), maximum variance
unfolding O(N3) Gaussian graphical models,
discrete graphical models - Clustering k-means, mean-shift O(N2),
hierarchical (FoF) clustering O(N3) - Testing and matching MST O(N3), bipartite
cross-matching O(N3), n-point correlation
2-sample testing O(Nn), kernel embedding
57 tasks ofmachine learning / data mining
- Querying spherical range-search O(N), orthogonal
range-search O(N), nearest-neighbor O(N),
all-nearest-neighbors O(N2) - Density estimation mixture of Gaussians, kernel
density estimation O(N2), kernel conditional
density estimation O(N3), submanifold density
estimation Ozakin Gray, NIPS 2010, O(N3),
convex adaptive kernel estimation Sastry Gray,
AISTATS 2011 O(N4) - Classification decision tree, nearest-neighbor
classifier O(N2), kernel discriminant analysis
O(N2), support vector machine O(N3) , Lp SVM,
non-negative SVM Guan et al, 2011 - Regression linear regression, LASSO, kernel
regression O(N2), Gaussian process regression
O(N3) - Dimension reduction PCA, non-negative matrix
factorization, kernel PCA O(N3), maximum variance
unfolding O(N3) Gaussian graphical models,
discrete graphical models, rank-preserving maps
Ouyang and Gray, ICML 2008 O(N3) isometric
separation maps Vasiiloglou, Gray, and Anderson
MLSP 2009 O(N3) isometric NMF Vasiiloglou,
Gray, and Anderson MLSP 2009 O(N3) functional
ICA Mehta and Gray, 2009, density preserving
maps Ozakin and Gray, in prep O(N3) - Clustering k-means, mean-shift O(N2),
hierarchical (FoF) clustering O(N3) - Testing and matching MST O(N3), bipartite
cross-matching O(N3), n-point correlation
2-sample testing O(Nn), kernel embedding
67 tasks ofmachine learning / data mining
- Querying spherical range-search O(N), orthogonal
range-search O(N), nearest-neighbor O(N),
all-nearest-neighbors O(N2) - Density estimation mixture of Gaussians, kernel
density estimation O(N2), kernel conditional
density estimation O(N3) - Classification decision tree, nearest-neighbor
classifier O(N2), kernel discriminant analysis
O(N2), support vector machine O(N3) , Lp SVM - Regression linear regression, kernel regression
O(N2), Gaussian process regression O(N3), LASSO - Dimension reduction PCA, non-negative matrix
factorization, kernel PCA O(N3), maximum variance
unfolding O(N3), Gaussian graphical models,
discrete graphical models - Clustering k-means, mean-shift O(N2),
hierarchical (FoF) clustering O(N3) - Testing and matching MST O(N3), bipartite
cross-matching O(N3), n-point correlation
2-sample testing O(Nn), kernel embedding
Computational Problem!
7The 7 Giants of Data(computational problem
types)Gray, Indyk, Mahoney, Szalay, in National
Acad of Sci Report on Analysis of Massive Data,
in prep
- Basic statistics means, covariances, etc.
- Generalized N-body problems distances, geometry
- Graph-theoretic problems discrete graphs
- Linear-algebraic problems matrix operations
- Optimizations unconstrained, convex
- Integrations general dimension
- Alignment problems dynamic prog, matching
87 general strategies
- Divide and conquer / indexing (trees)
- Function transforms (series)
- Sampling (Monte Carlo, active learning)
- Locality (caching)
- Streaming (online)
- Parallelism (clusters, GPUs)
- Problem transformation (reformulations)
91. Divide and conquer
- Fastest approach for
- nearest neighbor, range search (exact) O(logN)
Bentley 1970, all-nearest-neighbors (exact)
O(N) Gray Moore, NIPS 2000, Ram, Lee, March,
Gray, NIPS 2010, anytime nearest neighbor
(exact) Ram Gray, SDM 2012, max inner product
Ram Gray, under review - mixture of Gaussians Moore, NIPS 1999, k-means
Pelleg and Moore, KDD 1999, mean-shift
clustering O(N) Lee Gray, AISTATS 2009,
hierarchical clustering (single linkage,
friends-of-friends) O(NlogN) March Gray, KDD
2010 - nearest neighbor classification Liu, Moore,
Gray, NIPS 2004, kernel discriminant analysis
O(N) Riegel Gray, SDM 2008 - n-point correlation functions O(Nlogn) Gray
Moore, NIPS 2000, Moore et al. Mining the Sky
2000, multi-matcher jackknifed npcf March
Gray, under review
103-point correlation
(biggest previous 20K) VIRGO simulation
data, N 75,000,000 naïve 5x109 sec.
(150 years) multi-tree 55 sec.
(exact)
n2 O(N) n3 O(Nlog3) n4 O(N2)
113-point correlation
Naive - O(Nn) (estimated) Single bandwidth Gray Moore 2000, Moore et al. 2000 Multi-bandwidth March Gray in prep 2010 new
2 point cor. 100 matchers 2.0 x 107 s 352.8 s 56,000 4.96 s 71.1
3 point cor. 243 matchers 1.1 x 1011 s 891.6 s 1.23 x 108 13.58 s 65.6
4 point cor. 216 matchers 2.3 x 1014 s 14530 s 1.58 x 1010 503.6 s 28.8
106 points, galaxy simulation data
122. Function transforms
- Fastest approach for
- Kernel estimation (low-ish dimension) dual-tree
fast Gauss transforms (multipole/Hermite
expansions) Lee, Gray, Moore NIPS 2005, Lee
and Gray, UAI 2006 - KDE and GP (kernel density estimation, Gaussian
process regression) (high-D) random Fourier
functions Lee and Gray, in prep
133. Sampling
- Fastest approach for (approximate)
- PCA cosine trees Holmes, Gray, Isbell, NIPS
2008 - Kernel estimation bandwidth learning Holmes,
Gray, Isbell, NIPS 2006, Holmes, Gray, Isbell,
UAI 2007, Monte Carlo multipole method (with SVD
trees) Lee Gray, NIPS 2009 - Nearest-neighbor distance-approx spill trees
with random proj Liu, Moore, Gray, Yang, NIPS
2004, rank-approximate Ram, Ouyang, Gray, NIPS
2009
- Rank-approximate NN
- Best meaning-retaining approximation criterion in
the face of high-dimensional distances - More accurate than LSH
143. Sampling
- Active learning the sampling can depend on
previous samples - Linear classifiers rigorous framework for
pool-based active learning Sastry and Gray,
AISTATS 2012 - Empirically allows reduction in the number of
objects that require labeling - Theoretical rigor unbiasedness
154. Caching
- Fastest approach for (using disk)
- Nearest-neighbor, 2-point Disk-based treee
algorithms in Microsoft SQL Server Riegel,
Aditya, Budavari, Gray, in prep - Builds kd-tree on top of built-in B-trees
- Fixed-pass algorithm to build kd-tree
No. of points MLDB (Dual tree) Naive
40,000 8 seconds 159 seconds
200,000 43 seconds 3480 seconds
2,000,000 297 seconds 80 hours
10,000,000 29 mins 27 sec 74 days
20,000,000 58mins 48sec 280 days
40,000,000 112m 32 sec 2 years
165. Streaming / online
- Fastest approach for (approximate, or streaming)
- Online learning/stochastic optimization just use
the current sample to update the gradient - SVM (squared hinge loss) stochastic Frank-Wolfe
Ouyang and Gray, SDM 2010 - SVM, LASSO, et al. noise-adaptive stochastic
approximation Ouyang and Gray, in prep, on
arxiv, accelerated non-smooth SGD Ouyang and
Gray, under review - faster than SGD
- solves step size problem
- beats all existing convergence rates
176. Parallelism
- Fastest approach for (using many machines)
- KDE, GP, n-point distributed trees Lee and
Gray, SDM 2012, 6000 cores March et al, in
prep for Gordon Bell Prize 2012, 100K cores? - Each process owns the global tree and its local
tree - First log p levels built in parallel each
process determines where to send data - Asynchronous averaging provable convergence
- SVM, LASSO, et al. distributed online
optimization Ouyang and Gray, in prep, on arxiv - Provable theoretical speedup for the first time
187. Transformationsbetween problems
- Change the problem type
- Linear algebra on kernel matrices ? N-body inside
conjugate gradient Gray, TR 2004 - Euclidean graphs ? N-body problems March Gray,
KDD 2010 - HMM as graph ? matrix factorization Tran Gray,
in prep - Optimizations reformulate the objective and
constraints - Maximum variance unfolding SDP via
Burer-Monteiro convex relaxation Vasiloglou,
Gray, Anderson MLSP 2009 - Lq SVM, 0ltqlt1 DC programming Guan Gray, CSDA
2011 - L0 SVM mixed integer nonlinear program via
perspective cuts Guan Gray, under review - Do reformulations automatically Agarwal et al,
PADL 2010, Bhat et al, POPL 2012 - Create new ML methods with desired computational
properties - Density estimation trees nonparametric density
estimation, O(NlogN) Ram Gray, KDD 2011 - Local linear SVMs nonlinear classification,
O(NlogN) Sastry Gray, under review - Discriminative local coding nonlinear
classification O(NlogN) Mehta Gray, under
review
19Software
- For academic use only MLPACK
- Open source, C, written by students
- Data must fit in RAM distributed in progress
- For institutions Skytree Server
- First commercial-grade high-performance machine
learning server - Fastest, biggest ML available up to 10,000x
faster than existing solutions (on one machine) - V.12, April 2012-ish distributed, streaming
- Connects to stats packages, Matlab, DBMS, Python,
etc - www.skytreecorp.com
- Colleagues Email me to try it out
agray_at_cc.gatech.edu