Fast Algorithms - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Fast Algorithms

Description:

Fast Algorithms & Data Structures for Visualization and Machine Learning on ... Fast recommendation (Netflix) To find out more: Best way: agray_at_cc.gatech.edu ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 47
Provided by: alexand139
Learn more at: http://fodava.gatech.edu
Category:

less

Transcript and Presenter's Notes

Title: Fast Algorithms


1
Fast Algorithms Data Structures for
Visualization and Machine Learning on Massive
Data Sets
  • Alexander Gray
  • Fundamental Algorithmic and Statistical Tools
    Laboratory (FASTlab)
  • Computational Science and Engineering Division
  • College of Computing
  • Georgia Institute of Technology

2
The FASTlab
  • Arkadas Ozakin Research scientist, PhD
    theoretical physics
  • Dong Ryeol Lee PhD student, CS Math
  • Ryan Riegel PhD student, CS Math
  • Parikshit Ram PhD student, CS Math
  • William March PhD student, Math CS
  • James Waters PhD student, Physics CS
  • Nadeem Syed PhD student, CS
  • Hua Ouyang PhD student, CS
  • Sooraj Bhat PhD student, CS
  • Ravi Sastry PhD student, CS
  • Long Tran PhD student, CS
  • Michael Holmes PhD student, CS Physics
    (co-supervised)
  • Nikolaos Vasiloglou PhD student, EE
    (co-supervised)
  • Wei Guan PhD student, CS (co-supervised)
  • Nishant Mehta PhD student, CS (co-supervised)
  • Wee Chin Wong PhD student, ChemE (co-supervised)
  • Abhimanyu Aditya MS student, CS
  • Yatin Kanetkar MS student, CS

3
Goal
  • New displays for high-dimensional data
  • Isometric non-negative matrix factorization
  • Rank-based embedding
  • Density-preserving maps
  • Co-occurrence embedding
  • New algorithms for scaling them to big datasets
  • Distances Generalized Fast Multipole Method
  • Dot products Cosine Trees and QUIC-SVD
  • MLPACK

4
Plotting high-D in 2-D
Dimension reduction beyond PCA manifolds,
embedding, etc.
5
Goal
  • New displays for high-dimensional data
  • Isometric non-negative matrix factorization
  • Rank-based embedding
  • Density-preserving maps
  • Co-occurrence embedding
  • New algorithms for scaling them to big datasets
  • Distances Generalized Fast Multipole Method
  • Dot products Cosine Trees and QUIC-SVD
  • MLPACK

6
Isometric Non-negative Matrix Factorization
  • NMF maintains the interpretability of components
    of data like images or text or spectra (SDSS)
  • However as a low-D display it is not faithful in
    general to the original distances
  • Isometric NMF Vasiloglou, Gray, Anderson, to be
    submitted SIAM DM 2008 preserves both distances
    and non-negativity geometric prog formulation

7
Rank-based Embedding
  • Suppose you dont really have meaningful or
    reliable distances but you can say A and B are
    farther apart than A and C, e.g. in document
    relevance
  • It is still possible to make an embedding! In
    fact there is some indication that using ranks is
    more stable than using distances
  • Can be formulated using hyperkernels becomes
    either an SDP or a QP Ouyang and Gray, ICML 2008

8
Density-preserving Maps
  • Preserving densities is statistically more
    meaningful than preserving distances
  • Might allow more reliable conclusions from the
    low-D display about clustering and outliers
  • DC formulation (Ozakin and Gray, to be submitted
    AISTATS 2008)

9
Co-occurrence Embedding
  • Consider InBio data 3M occurrences of species in
    Costa Rica
  • Densities are not reliable as the sampling
    strategy is unknown
  • But the overlapping of two species densities
    (co-occurrence) may be more reliable
  • How can distribution distances be embedded (Syed,
    Ozakin, Gray to be submitted ICML 2009)?

10
Goal
  • New displays for high-dimensional data
  • Isometric non-negative matrix factorization
  • Rank-based embedding
  • Density-preserving maps
  • Co-occurrence embedding
  • New algorithms for scaling them to big datasets
  • Distances Generalized Fast Multipole Method
  • Dot products Cosine Trees and QUIC-SVD
  • MLPACK

11
Computational problem
  • Such manifold methods are expensive typically
    O(N3)
  • But it is big datasets that are often the most
    important to visually summarize
  • What are the underlying computations?
  • All-k-nearest-neighbors
  • Kernel summations
  • Eigendecomposition
  • Convex optimization

12
Computational problem
  • Such manifold methods are expensive typically
    O(N3)
  • But it is big datasets that are often the most
    important to visually summarize
  • What are the underlying computations?
  • All-k-nearest-neighbors (distances)
  • Kernel summations (distances)
  • Eigendecomposition (dot products)
  • Convex optimization (dot products)

13
Distances Generalized Fast Multipole Method
  • Generalized N-body Problems (Gray and Moore NIPS
    2000 Riegel, Boyer, and Gray TR 2008) include
  • All-k-nearest-neighbors
  • Kernel summations
  • Force summations in physics
  • A very large number of bottleneck statistics and
    machine learning computations
  • Defined using category theory

14
Distances Generalized Fast Multipole Method
  • There exists a generalization (Gray and Moore
    NIPS 2000 Riegel, Boyer, and Gray TR 2008) of
    the Fast Multipole Method (Greengard and Rokhlin
    1987) which
  • specializes to each of these problems
  • is the fastest practical algorithm for these
    problems
  • elucidates general principles for such problems
  • Parallel THOR (Tree-based Higher-Order Reduce)

15
Distances Generalized Fast Multipole Method
  • Elements of the GFMM
  • A spatial tree data structure, e.g. kd-trees,
    metric trees, cover-trees, SVD trees, disk trees
  • A tree expansion pattern
  • Tree-stored cached statistics
  • An error criterion and pruning criterion
  • A local approximation/pruning scheme with error
    bounds, e.g. Hermite expansions, Monte Carlo,
    exact pruning

16
kd-trees most widely-used space-partitioning
tree Bentley 1975, Friedman, Bentley Finkel
1977,Moore Lee 1995
17
A kd-tree level 1
18
A kd-tree level 2
19
A kd-tree level 3
20
A kd-tree level 4
21
A kd-tree level 5
22
A kd-tree level 6
23
Example Generalized histogram
bandwidth h
query point q
24
Range-count recursive algorithm
25
Range-count recursive algorithm
26
Range-count recursive algorithm
27
Range-count recursive algorithm
28
Range-count recursive algorithm
Pruned! (inclusion)
29
Range-count recursive algorithm
30
Range-count recursive algorithm
31
Range-count recursive algorithm
32
Range-count recursive algorithm
33
Range-count recursive algorithm
34
Range-count recursive algorithm
35
Range-count recursive algorithm
36
Range-count recursive algorithm
Pruned! (exclusion)
37
Range-count recursive algorithm
38
Range-count recursive algorithm
39
Range-count recursive algorithm
fastest practical algorithm Bentley 1975 our
algorithms can use any tree
40
Dot productsCosine Trees and QUIC-SVD
  • QUIC-SVD (Holmes, Gray, Isbell NIPS 2008)
  • Cosine Trees Trees for dot products
  • Use Monte Carlo within cosine trees to achieve
    best-rank approximation with user-specified
    relative error
  • Very fast, but with probabilistic bounds

41
Dot productsCosine Trees and QUIC-SVD
  • Uses of QUIC-SVD
  • PCA, KPCA eigendecomposition
  • Working on fast interior-point convex
    optimization

42
Bigger goal make all the best statistical/learnin
g methods efficient!
  • Ground rules
  • Asymptotic speedup as well as practical speedup
  • Arbitrarily high accuracy with error guarantees
  • No manual tweaking
  • Really works (validated in a big real-world
    problem)
  • Treating entire classes of methods
  • Methods based on distances (generalized N-body
    problems)
  • Methods based on dot products (linear algebra)
  • Soon Methods based on discrete structures
    (combinatorial/graph problems)
  • Watch for MLPACK, coming Dec. 2008
  • Meant to be the equivalent of linear algebras
    LAPACK

43
So far fastest algs for
  • 2000 all-nearest-neighbors (1970)
  • 2000 n-point correlation functions (1950)
  • 2003,05,06 kernel density estimation (1953)
  • 2004 nearest-neighbor classification (1965)
  • 2005,06,08 nonparametric Bayes classifier (1951)
  • 2006 mean-shift clustering/tracking (1972)
  • 2006 k-means clustering (1960s)
  • 2007 hierarchical clustering/EMST (1960s)
  • 2007 affinity propagation/clustering (2007)

44
So far fastest algs for
  • 2008 principal component analysis (1930s)
  • 2008 local linear kernel regression (1960s)
  • 2008 hidden Markov models (1970s)
  • Working on
  • linear regression, Kalman filters (1960s)
  • Gaussian process regression (1960s)
  • Gaussian graphical models (1970s)
  • Manifolds, spectral clustering (2000s)
  • Convex kernel machines (2000s)

45
Some application highlights so far
  • First large-scale dark energy confirmation top
    Science breakthrough of 2003)
  • First large-scale cosmic magnification
    confirmation (Nature, 2005)
  • Integration into Google image search (we think),
    2005
  • Integration into Microsoft SQL Server, 2008
  • Working on
  • Integration into Large Hadron Collider pipeline,
    2008
  • Fast IP-based spam filtering (Secure Comp, 2008)
  • Fast recommendation (Netflix)

46
To find out more
  • Best way agray_at_cc.gatech.edu
  • Mostly outdated www.cc.gatech.edu/agray
Write a Comment
User Comments (0)
About PowerShow.com