Fast Algorithms for Analyzing Massive Data - PowerPoint PPT Presentation

About This Presentation
Title:

Fast Algorithms for Analyzing Massive Data

Description:

Fast Algorithms for Analyzing Massive Data Alexander Gray Georgia Institute of Technology www.fast-lab.org The FASTlab Fundamental Algorithmic and Statistical Tools ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 20
Provided by: heaharvar
Category:

less

Transcript and Presenter's Notes

Title: Fast Algorithms for Analyzing Massive Data


1
Fast Algorithms for Analyzing Massive Data
  • Alexander Gray
  • Georgia Institute of Technology
  • www.fast-lab.org

2
The FASTlabFundamental Algorithmic and
Statistical Tools Laboratorywww.fast-lab.org
  • Alexander Gray Assoc Prof, Applied Math CS
    PhD CS
  • Arkadas Ozakin Research Scientist, Math
    Physics PhD Physics
  • Dongryeol Lee PhD student, CS Math
  • Ryan Riegel PhD student, CS Math
  • Sooraj Bhat PhD student, CS
  • Nishant Mehta PhD student, CS
  • Parikshit Ram PhD student, CS Math
  • William March PhD student, Math CS
  • Hua Ouyang PhD student, CS
  • Ravi Sastry PhD student, CS
  • Long Tran PhD student, CS
  • Ryan Curtin PhD student, EE
  • Ailar Javadi PhD student, EE
  • Anita Zakrzewska PhD student, CS
  • 5-10 MS students and undergraduates

3
7 tasks ofmachine learning / data mining
  • Querying spherical range-search O(N), orthogonal
    range-search O(N), nearest-neighbor O(N),
    all-nearest-neighbors O(N2)
  • Density estimation mixture of Gaussians, kernel
    density estimation O(N2), kernel conditional
    density estimation O(N3)
  • Classification decision tree, nearest-neighbor
    classifier O(N2), kernel discriminant analysis
    O(N2), support vector machine O(N3) , Lp SVM
  • Regression linear regression, LASSO, kernel
    regression O(N2), Gaussian process regression
    O(N3)
  • Dimension reduction PCA, non-negative matrix
    factorization, kernel PCA O(N3), maximum variance
    unfolding O(N3) Gaussian graphical models,
    discrete graphical models
  • Clustering k-means, mean-shift O(N2),
    hierarchical (FoF) clustering O(N3)
  • Testing and matching MST O(N3), bipartite
    cross-matching O(N3), n-point correlation
    2-sample testing O(Nn), kernel embedding

4
7 tasks ofmachine learning / data mining
  • Querying spherical range-search O(N), orthogonal
    range-search O(N), nearest-neighbor O(N),
    all-nearest-neighbors O(N2)
  • Density estimation mixture of Gaussians, kernel
    density estimation O(N2), kernel conditional
    density estimation O(N3)
  • Classification decision tree, nearest-neighbor
    classifier O(N2), kernel discriminant analysis
    O(N2), support vector machine O(N3), Lp SVM
  • Regression linear regression, LASSO, kernel
    regression O(N2), Gaussian process regression
    O(N3)
  • Dimension reduction PCA, non-negative matrix
    factorization, kernel PCA O(N3), maximum variance
    unfolding O(N3) Gaussian graphical models,
    discrete graphical models
  • Clustering k-means, mean-shift O(N2),
    hierarchical (FoF) clustering O(N3)
  • Testing and matching MST O(N3), bipartite
    cross-matching O(N3), n-point correlation
    2-sample testing O(Nn), kernel embedding

5
7 tasks ofmachine learning / data mining
  • Querying spherical range-search O(N), orthogonal
    range-search O(N), nearest-neighbor O(N),
    all-nearest-neighbors O(N2)
  • Density estimation mixture of Gaussians, kernel
    density estimation O(N2), kernel conditional
    density estimation O(N3), submanifold density
    estimation Ozakin Gray, NIPS 2010, O(N3),
    convex adaptive kernel estimation Sastry Gray,
    AISTATS 2011 O(N4)
  • Classification decision tree, nearest-neighbor
    classifier O(N2), kernel discriminant analysis
    O(N2), support vector machine O(N3) , Lp SVM,
    non-negative SVM Guan et al, 2011
  • Regression linear regression, LASSO, kernel
    regression O(N2), Gaussian process regression
    O(N3)
  • Dimension reduction PCA, non-negative matrix
    factorization, kernel PCA O(N3), maximum variance
    unfolding O(N3) Gaussian graphical models,
    discrete graphical models, rank-preserving maps
    Ouyang and Gray, ICML 2008 O(N3) isometric
    separation maps Vasiiloglou, Gray, and Anderson
    MLSP 2009 O(N3) isometric NMF Vasiiloglou,
    Gray, and Anderson MLSP 2009 O(N3) functional
    ICA Mehta and Gray, 2009, density preserving
    maps Ozakin and Gray, in prep O(N3)
  • Clustering k-means, mean-shift O(N2),
    hierarchical (FoF) clustering O(N3)
  • Testing and matching MST O(N3), bipartite
    cross-matching O(N3), n-point correlation
    2-sample testing O(Nn), kernel embedding

6
7 tasks ofmachine learning / data mining
  • Querying spherical range-search O(N), orthogonal
    range-search O(N), nearest-neighbor O(N),
    all-nearest-neighbors O(N2)
  • Density estimation mixture of Gaussians, kernel
    density estimation O(N2), kernel conditional
    density estimation O(N3)
  • Classification decision tree, nearest-neighbor
    classifier O(N2), kernel discriminant analysis
    O(N2), support vector machine O(N3) , Lp SVM
  • Regression linear regression, kernel regression
    O(N2), Gaussian process regression O(N3), LASSO
  • Dimension reduction PCA, non-negative matrix
    factorization, kernel PCA O(N3), maximum variance
    unfolding O(N3), Gaussian graphical models,
    discrete graphical models
  • Clustering k-means, mean-shift O(N2),
    hierarchical (FoF) clustering O(N3)
  • Testing and matching MST O(N3), bipartite
    cross-matching O(N3), n-point correlation
    2-sample testing O(Nn), kernel embedding

Computational Problem!
7
The 7 Giants of Data(computational problem
types)Gray, Indyk, Mahoney, Szalay, in National
Acad of Sci Report on Analysis of Massive Data,
in prep
  • Basic statistics means, covariances, etc.
  • Generalized N-body problems distances, geometry
  • Graph-theoretic problems discrete graphs
  • Linear-algebraic problems matrix operations
  • Optimizations unconstrained, convex
  • Integrations general dimension
  • Alignment problems dynamic prog, matching

8
7 general strategies
  • Divide and conquer / indexing (trees)
  • Function transforms (series)
  • Sampling (Monte Carlo, active learning)
  • Locality (caching)
  • Streaming (online)
  • Parallelism (clusters, GPUs)
  • Problem transformation (reformulations)

9
1. Divide and conquer
  • Fastest approach for
  • nearest neighbor, range search (exact) O(logN)
    Bentley 1970, all-nearest-neighbors (exact)
    O(N) Gray Moore, NIPS 2000, Ram, Lee, March,
    Gray, NIPS 2010, anytime nearest neighbor
    (exact) Ram Gray, SDM 2012, max inner product
    Ram Gray, under review
  • mixture of Gaussians Moore, NIPS 1999, k-means
    Pelleg and Moore, KDD 1999, mean-shift
    clustering O(N) Lee Gray, AISTATS 2009,
    hierarchical clustering (single linkage,
    friends-of-friends) O(NlogN) March Gray, KDD
    2010
  • nearest neighbor classification Liu, Moore,
    Gray, NIPS 2004, kernel discriminant analysis
    O(N) Riegel Gray, SDM 2008
  • n-point correlation functions O(Nlogn) Gray
    Moore, NIPS 2000, Moore et al. Mining the Sky
    2000, multi-matcher jackknifed npcf March
    Gray, under review

10
3-point correlation
(biggest previous 20K) VIRGO simulation
data, N 75,000,000 naïve 5x109 sec.
(150 years) multi-tree 55 sec.
(exact)
n2 O(N) n3 O(Nlog3) n4 O(N2)
11
3-point correlation
Naive - O(Nn) (estimated) Single bandwidth Gray Moore 2000, Moore et al. 2000 Multi-bandwidth March Gray in prep 2010 new
2 point cor. 100 matchers 2.0 x 107 s 352.8 s 56,000 4.96 s 71.1
3 point cor. 243 matchers 1.1 x 1011 s 891.6 s 1.23 x 108 13.58 s 65.6
4 point cor. 216 matchers 2.3 x 1014 s 14530 s 1.58 x 1010 503.6 s 28.8
106 points, galaxy simulation data
12
2. Function transforms
  • Fastest approach for
  • Kernel estimation (low-ish dimension) dual-tree
    fast Gauss transforms (multipole/Hermite
    expansions) Lee, Gray, Moore NIPS 2005, Lee
    and Gray, UAI 2006
  • KDE and GP (kernel density estimation, Gaussian
    process regression) (high-D) random Fourier
    functions Lee and Gray, in prep

13
3. Sampling
  • Fastest approach for (approximate)
  • PCA cosine trees Holmes, Gray, Isbell, NIPS
    2008
  • Kernel estimation bandwidth learning Holmes,
    Gray, Isbell, NIPS 2006, Holmes, Gray, Isbell,
    UAI 2007, Monte Carlo multipole method (with SVD
    trees) Lee Gray, NIPS 2009
  • Nearest-neighbor distance-approx spill trees
    with random proj Liu, Moore, Gray, Yang, NIPS
    2004, rank-approximate Ram, Ouyang, Gray, NIPS
    2009
  • Rank-approximate NN
  • Best meaning-retaining approximation criterion in
    the face of high-dimensional distances
  • More accurate than LSH

14
3. Sampling
  • Active learning the sampling can depend on
    previous samples
  • Linear classifiers rigorous framework for
    pool-based active learning Sastry and Gray,
    AISTATS 2012
  • Empirically allows reduction in the number of
    objects that require labeling
  • Theoretical rigor unbiasedness

15
4. Caching
  • Fastest approach for (using disk)
  • Nearest-neighbor, 2-point Disk-based treee
    algorithms in Microsoft SQL Server Riegel,
    Aditya, Budavari, Gray, in prep
  • Builds kd-tree on top of built-in B-trees
  • Fixed-pass algorithm to build kd-tree

No. of points MLDB (Dual tree) Naive
40,000 8 seconds 159 seconds
200,000 43 seconds 3480 seconds
2,000,000 297 seconds 80 hours
10,000,000 29 mins 27 sec 74 days
20,000,000 58mins 48sec 280 days
40,000,000 112m 32 sec 2 years
16
5. Streaming / online
  • Fastest approach for (approximate, or streaming)
  • Online learning/stochastic optimization just use
    the current sample to update the gradient
  • SVM (squared hinge loss) stochastic Frank-Wolfe
    Ouyang and Gray, SDM 2010
  • SVM, LASSO, et al. noise-adaptive stochastic
    approximation Ouyang and Gray, in prep, on
    arxiv, accelerated non-smooth SGD Ouyang and
    Gray, under review
  • faster than SGD
  • solves step size problem
  • beats all existing convergence rates

17
6. Parallelism
  • Fastest approach for (using many machines)
  • KDE, GP, n-point distributed trees Lee and
    Gray, SDM 2012, 6000 cores March et al, in
    prep for Gordon Bell Prize 2012, 100K cores?
  • Each process owns the global tree and its local
    tree
  • First log p levels built in parallel each
    process determines where to send data
  • Asynchronous averaging provable convergence
  • SVM, LASSO, et al. distributed online
    optimization Ouyang and Gray, in prep, on arxiv
  • Provable theoretical speedup for the first time

18
7. Transformationsbetween problems
  • Change the problem type
  • Linear algebra on kernel matrices ? N-body inside
    conjugate gradient Gray, TR 2004
  • Euclidean graphs ? N-body problems March Gray,
    KDD 2010
  • HMM as graph ? matrix factorization Tran Gray,
    in prep
  • Optimizations reformulate the objective and
    constraints
  • Maximum variance unfolding SDP via
    Burer-Monteiro convex relaxation Vasiloglou,
    Gray, Anderson MLSP 2009
  • Lq SVM, 0ltqlt1 DC programming Guan Gray, CSDA
    2011
  • L0 SVM mixed integer nonlinear program via
    perspective cuts Guan Gray, under review
  • Do reformulations automatically Agarwal et al,
    PADL 2010, Bhat et al, POPL 2012
  • Create new ML methods with desired computational
    properties
  • Density estimation trees nonparametric density
    estimation, O(NlogN) Ram Gray, KDD 2011
  • Local linear SVMs nonlinear classification,
    O(NlogN) Sastry Gray, under review
  • Discriminative local coding nonlinear
    classification O(NlogN) Mehta Gray, under
    review

19
Software
  • For academic use only MLPACK
  • Open source, C, written by students
  • Data must fit in RAM distributed in progress
  • For institutions Skytree Server
  • First commercial-grade high-performance machine
    learning server
  • Fastest, biggest ML available up to 10,000x
    faster than existing solutions (on one machine)
  • V.12, April 2012-ish distributed, streaming
  • Connects to stats packages, Matlab, DBMS, Python,
    etc
  • www.skytreecorp.com
  • Colleagues Email me to try it out
    agray_at_cc.gatech.edu
Write a Comment
User Comments (0)
About PowerShow.com