Parallel Coordinate Descent for L1-Regularized Loss Minimization - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Parallel Coordinate Descent for L1-Regularized Loss Minimization

Description:

Title: Distributed Inference in Sensor Networks Author: S Last modified by: Joseph Bradley Created Date: 12/3/2003 4:12:11 AM Document presentation format – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 31
Provided by: S66
Category:

less

Transcript and Presenter's Notes

Title: Parallel Coordinate Descent for L1-Regularized Loss Minimization


1
Parallel Coordinate Descent for L1-Regularized
Loss Minimization
  • Joseph K. Bradley Aapo Kyrola
  • Danny Bickson Carlos Guestrin

2
L1-Regularized Regression
Stock volatility (label)
Bigrams from financial reports (features)
5x106 features 3x104 samples
(Kogan et al., 2009)
3
From Sequential Optimization...
  • Many algorithms
  • Gradient descent, stochastic gradient, interior
    point methods, hard/soft thresholding, ...
  • Coordinate descent (a.k.a. Shooting (Fu, 1998))

But for big problems? 5x106 features 3x104
samples
4
...to Parallel Optimization
Surprisingly, no!
5
Our Work
  • Shotgun Parallel coordinate descent
  • for L1-regularized regression
  • Parallel convergence analysis
  • Linear speedups up to a problem-dependent limit
  • Large-scale empirical study
  • 37 datasets, 9 algorithms

6
Lasso
(Tibshirani, 1996)
Objective
Squared error
L1 regularization
7
Shooting Sequential SCD
Stochastic Coordinate Descent (SCD) (e.g.,
Shalev-Shwartz Tewari, 2009)
  • While not converged,
  • Choose random coordinate j,
  • Update xj (closed-form minimization)

8
Shotgun Parallel SCD
Shotgun (Parallel SCD)
  • While not converged,
  • On each of P processors,
  • Choose random coordinate j,
  • Update xj (same as for Shooting)

Nice case Uncorrelated features
Bad case Correlated features
Is SCD inherently sequential?
9
Is SCD inherently sequential?
Coordinate update
(closed-form minimization)
10
Is SCD inherently sequential?
Theorem
If A is normalized s.t. diag(ATA)1,
11
Is SCD inherently sequential?
Theorem
If A is normalized s.t. diag(ATA)1,
12
Convergence Analysis
Main Theorem Shotgun Convergence
Generalizes bounds for Shooting (Shalev-Shwartz
Tewari, 2009)
13
Convergence Analysis
14
Convergence Analysis
Theorem Shotgun Convergence
Experiments match the theory!
Assume .
15
Thus far...
  • Shotgun
  • The naive parallelization of coordinate descent
    works!
  • Theorem Linear speedups up to a
    problem-dependent limit.

Now for some experiments
16
Experiments Lasso
17
Experiments Lasso
Sparse Compressed Imaging
Sparco (van den Berg et al., 2009)
Shotgun faster
Shotgun faster
Other alg. runtime (sec)
Other alg. runtime (sec)
On this (data,?) Shotgun 1.212s Shooting 3.406s
Shotgun slower
Shotgun slower
Shotgun runtime (sec)
Shotgun runtime (sec)
(Parallel)
Shotgun Parallel L1_LS used 8 cores.
18
Experiments Lasso
Single-Pixel Camera (Duarte et al., 2008)
Shotgun faster
Other alg. runtime (sec)
Shotgun slower
Shotgun runtime (sec)
(Parallel)
Shotgun Parallel L1_LS used 8 cores.
19
Experiments Lasso
Single-Pixel Camera (Duarte et al., 2008)
Shotgun faster
Shooting is one of the fastest algorithms. Shotgun
provides additional speedups.
Other alg. runtime (sec)
Shotgun slower
Shotgun runtime (sec)
(Parallel)
Shotgun Parallel L1_LS used 8 cores.
20
Experiments Logistic Regression
21
Experiments Logistic Regression
Zeta dataset low-dimensional setting
better
Objective Value
Time (sec)
From the Pascal Large Scale Learning Challenge
http//www.mlbench.org/instructions/
Shotgun Parallel SGD used 8 cores.
22
Experiments Logistic Regression
rcv1 dataset (Lewis et al, 2004)
high-dimensional setting
better
Objective Value
Time (sec)
Shotgun Parallel SGD used 8 cores.
23
Shotgun Self-speedup
Aggregated results from all tests
But we are doing fewer iterations! ?
Optimal
Lasso Iteration Speedup
Speedup
Logistic Reg. Time Speedup
Not so great ?
Lasso Time Speedup
cores
24
Conclusions
  • Future Work
  • Hybrid Shotgun parallel SGD
  • More FLOPS/datum, e.g., Group Lasso (Yuan and
    Lin, 2006)
  • Alternate hardware, e.g., graphics processors

Thanks!
25
References
  • Blumensath, T. and Davies, M.E. Iterative hard
    thresholding for compressed sensing. Applied and
    Computational Harmonic Analysis, 27(3)265274,
    2009.
  • Duarte, M.F., Davenport, M.A., Takhar, D., Laska,
    J.N., Sun, T., Kelly, K.F., and Baraniuk, R.G.
    Single-pixel imaging via compressive sampling.
    Signal Processing Magazine, IEEE, 25(2)8391,
    2008.
  • Figueiredo, M.A.T, Nowak, R.D., and Wright, S.J.
    Gradient projection for sparse reconstruction
    Application to compressed sensing and other
    inverse problems. IEEE J. of Sel. Top. in Signal
    Processing, 1(4)586597, 2008.
  • Friedman, J., Hastie, T., and Tibshirani, R.
    Regularization paths for generalized linear
    models via coordinate descent. Journal of
    Statistical Software, 33(1)122, 2010.
  • Fu, W.J. Penalized regressions The bridge versus
    the lasso. J. of Comp. and Graphical Statistics,
    7(3)397 416, 1998.
  • Kim, S. J., Koh, K., Lustig, M., Boyd, S., and
    Gorinevsky, D. An interior-point method for
    large-scale l1-regularized least squares. IEEE
    Journal of Sel. Top. in Signal Processing,
    1(4)606617, 2007.
  • Kogan, S., Levin, D., Routledge, B.R., Sagi,
    J.S., and Smith, N.A. Predicting risk from
    financial reports with regression. In Human
    Language Tech.-NAACL, 2009.
  • Langford, J., Li, L., and Zhang, T. Sparse online
    learning via truncated gradient. In NIPS, 2009a.
  • Lewis, D.D., Yang, Y., Rose, T.G., and Li, F.
    RCV1 A new benchmark collection for text
    categorization research. JMLR, 5361397, 2004.
  • Ng, A.Y. Feature selection, l1 vs. l2
    regularization and rotational invariance. In
    ICML, 2004.
  • Shalev-Shwartz, S. and Tewari, A. Stochastic
    methods for l1 regularized loss minimization. In
    ICML, 2009.
  • Tibshirani, R. Regression shrinkage and selection
    via the lasso. J. Royal Statistical Society,
    58(1)267288, 1996.
  • van den Berg, E., Friedlander, M.P., Hennenfent,
    G., Herrmann, F., Saab, R., and Yilmaz, O.
    Sparco A testing framework for sparse
    reconstruction. ACM Transactions on Mathematical
    Software, 35(4)116, 2009.
  • Wen, Z., Yin, W. Goldfarb, D., and Zhang, Y. A
    fast algorithm for sparse reconstruction based on
    shrinkage, subspace optimization and
    continuation. SIAM Journal on Scientific
    Computing, 32(4)18321857, 2010.
  • Wright, S.J., Nowak, D.R., and Figueiredo, M.A.T.
    Sparse reconstruction by separable approximation.
    IEEE Trans. on Signal Processing,
    57(7)24792493, 2009.
  • Wulf, W.A. and McKee, S.A. Hitting the memory
    wall Implications of the obvious. ACM SIGARCH
    Computer Architecture News, 23(1)2024, 1995.
  • Yuan, G. X., Chang, K. W., Hsieh, C. J., and Lin,
    C. J. A comparison of optimization methods and
    software for large-scale l1-reg. linear
    classification. JMLR, 113183 3234, 2010.
  • Zinkevich, M., Weimer, M., Smola, A.J., and Li,
    L. Parallelized stochastic gradient descent. In
    NIPS, 2010.

26
TO DO
  • References slide
  • Backup slides
  • Discussion with reviewer about SGD vs SCD in
    terms of d,n

27
Experiments Logistic Regression
Zeta dataset low-dimensional setting
Objective Value
better
Test Error
Pascal Large Scale Learning Challenge
http//www.mlbench.org/instructions/
Time (sec)
28
Experiments Logistic Regression
rcv1 dataset (Lewis et al, 2004)
high-dimensional setting
Objective Value
better
Test Error
Time (sec)
29
Shotgun Improving Self-speedup
Lasso Time Speedup
Logistic Regression Time Speedup
Max
Max
Speedup (time)
Speedup (time)
Mean
Mean
Min
Min
cores
cores
Lasso Iteration Speedup
Max
Mean
Min
Speedup (iterations)
cores
30
Shotgun Self-speedup
Lasso Time Speedup
Not so great ?
Max
Speedup (time)
Mean
Min
cores
Max
Mean
But we are doing fewer iterations! ?
Min
Write a Comment
User Comments (0)
About PowerShow.com