Parallel Coordinate Descent for L1-Regularized Loss Minimization - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Parallel Coordinate Descent for L1-Regularized Loss Minimization

Description:

Title: Distributed Inference in Sensor Networks Author: S Last modified by: Joseph Bradley Created Date: 12/3/2003 4:12:11 AM Document presentation format – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 31

Provided by: S66

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Coordinate Descent for L1-Regularized Loss Minimization

1
Parallel Coordinate Descent for L1-Regularized
Loss Minimization

Joseph K. Bradley Aapo Kyrola
Danny Bickson Carlos Guestrin

2
L1-Regularized Regression
Stock volatility (label)
Bigrams from financial reports (features)
5x106 features 3x104 samples
(Kogan et al., 2009)
3
From Sequential Optimization...

Many algorithms
Gradient descent, stochastic gradient, interior
point methods, hard/soft thresholding, ...
Coordinate descent (a.k.a. Shooting (Fu, 1998))

But for big problems? 5x106 features 3x104
samples
4
...to Parallel Optimization
Surprisingly, no!
5
Our Work

Shotgun Parallel coordinate descent
for L1-regularized regression
Parallel convergence analysis
Linear speedups up to a problem-dependent limit
Large-scale empirical study
37 datasets, 9 algorithms

6
Lasso
(Tibshirani, 1996)
Objective
Squared error
L1 regularization
7
Shooting Sequential SCD
Stochastic Coordinate Descent (SCD) (e.g.,
Shalev-Shwartz Tewari, 2009)

While not converged,
Choose random coordinate j,
Update xj (closed-form minimization)

8
Shotgun Parallel SCD
Shotgun (Parallel SCD)

While not converged,
On each of P processors,
Choose random coordinate j,
Update xj (same as for Shooting)

Nice case Uncorrelated features
Bad case Correlated features
Is SCD inherently sequential?
9
Is SCD inherently sequential?
Coordinate update
(closed-form minimization)
10
Is SCD inherently sequential?
Theorem
If A is normalized s.t. diag(ATA)1,
11
Is SCD inherently sequential?
Theorem
If A is normalized s.t. diag(ATA)1,
12
Convergence Analysis
Main Theorem Shotgun Convergence
Generalizes bounds for Shooting (Shalev-Shwartz
Tewari, 2009)
13
Convergence Analysis
14
Convergence Analysis
Theorem Shotgun Convergence
Experiments match the theory!
Assume .
15
Thus far...

Shotgun
The naive parallelization of coordinate descent
works!
Theorem Linear speedups up to a
problem-dependent limit.

Now for some experiments
16
Experiments Lasso
17
Experiments Lasso
Sparse Compressed Imaging
Sparco (van den Berg et al., 2009)
Shotgun faster
Shotgun faster
Other alg. runtime (sec)
Other alg. runtime (sec)
On this (data,?) Shotgun 1.212s Shooting 3.406s
Shotgun slower
Shotgun slower
Shotgun runtime (sec)
Shotgun runtime (sec)
(Parallel)
Shotgun Parallel L1_LS used 8 cores.
18
Experiments Lasso
Single-Pixel Camera (Duarte et al., 2008)
Shotgun faster
Other alg. runtime (sec)
Shotgun slower
Shotgun runtime (sec)
(Parallel)
Shotgun Parallel L1_LS used 8 cores.
19
Experiments Lasso
Single-Pixel Camera (Duarte et al., 2008)
Shotgun faster
Shooting is one of the fastest algorithms. Shotgun
provides additional speedups.
Other alg. runtime (sec)
Shotgun slower
Shotgun runtime (sec)
(Parallel)
Shotgun Parallel L1_LS used 8 cores.
20
Experiments Logistic Regression
21
Experiments Logistic Regression
Zeta dataset low-dimensional setting
better
Objective Value
Time (sec)
From the Pascal Large Scale Learning Challenge
http//www.mlbench.org/instructions/
Shotgun Parallel SGD used 8 cores.
22
Experiments Logistic Regression
rcv1 dataset (Lewis et al, 2004)
high-dimensional setting
better
Objective Value
Time (sec)
Shotgun Parallel SGD used 8 cores.
23
Shotgun Self-speedup
Aggregated results from all tests
But we are doing fewer iterations! ?
Optimal
Lasso Iteration Speedup
Speedup
Logistic Reg. Time Speedup
Not so great ?
Lasso Time Speedup
cores
24
Conclusions

Future Work
Hybrid Shotgun parallel SGD
More FLOPS/datum, e.g., Group Lasso (Yuan and
Lin, 2006)
Alternate hardware, e.g., graphics processors

Thanks!
25
References

Blumensath, T. and Davies, M.E. Iterative hard
thresholding for compressed sensing. Applied and
Computational Harmonic Analysis, 27(3)265274,
2009.
Duarte, M.F., Davenport, M.A., Takhar, D., Laska,
J.N., Sun, T., Kelly, K.F., and Baraniuk, R.G.
Single-pixel imaging via compressive sampling.
Signal Processing Magazine, IEEE, 25(2)8391,
2008.
Figueiredo, M.A.T, Nowak, R.D., and Wright, S.J.
Gradient projection for sparse reconstruction
Application to compressed sensing and other
inverse problems. IEEE J. of Sel. Top. in Signal
Processing, 1(4)586597, 2008.
Friedman, J., Hastie, T., and Tibshirani, R.
Regularization paths for generalized linear
models via coordinate descent. Journal of
Statistical Software, 33(1)122, 2010.
Fu, W.J. Penalized regressions The bridge versus
the lasso. J. of Comp. and Graphical Statistics,
7(3)397 416, 1998.
Kim, S. J., Koh, K., Lustig, M., Boyd, S., and
Gorinevsky, D. An interior-point method for
large-scale l1-regularized least squares. IEEE
Journal of Sel. Top. in Signal Processing,
1(4)606617, 2007.
Kogan, S., Levin, D., Routledge, B.R., Sagi,
J.S., and Smith, N.A. Predicting risk from
financial reports with regression. In Human
Language Tech.-NAACL, 2009.
Langford, J., Li, L., and Zhang, T. Sparse online
learning via truncated gradient. In NIPS, 2009a.
Lewis, D.D., Yang, Y., Rose, T.G., and Li, F.
RCV1 A new benchmark collection for text
categorization research. JMLR, 5361397, 2004.
Ng, A.Y. Feature selection, l1 vs. l2
regularization and rotational invariance. In
ICML, 2004.
Shalev-Shwartz, S. and Tewari, A. Stochastic
methods for l1 regularized loss minimization. In
ICML, 2009.
Tibshirani, R. Regression shrinkage and selection
via the lasso. J. Royal Statistical Society,
58(1)267288, 1996.
van den Berg, E., Friedlander, M.P., Hennenfent,
G., Herrmann, F., Saab, R., and Yilmaz, O.
Sparco A testing framework for sparse
reconstruction. ACM Transactions on Mathematical
Software, 35(4)116, 2009.
Wen, Z., Yin, W. Goldfarb, D., and Zhang, Y. A
fast algorithm for sparse reconstruction based on
shrinkage, subspace optimization and
continuation. SIAM Journal on Scientific
Computing, 32(4)18321857, 2010.
Wright, S.J., Nowak, D.R., and Figueiredo, M.A.T.
Sparse reconstruction by separable approximation.
IEEE Trans. on Signal Processing,
57(7)24792493, 2009.
Wulf, W.A. and McKee, S.A. Hitting the memory
wall Implications of the obvious. ACM SIGARCH
Computer Architecture News, 23(1)2024, 1995.
Yuan, G. X., Chang, K. W., Hsieh, C. J., and Lin,
C. J. A comparison of optimization methods and
software for large-scale l1-reg. linear
classification. JMLR, 113183 3234, 2010.
Zinkevich, M., Weimer, M., Smola, A.J., and Li,
L. Parallelized stochastic gradient descent. In
NIPS, 2010.

26
TO DO

References slide
Backup slides
Discussion with reviewer about SGD vs SCD in
terms of d,n

27
Experiments Logistic Regression
Zeta dataset low-dimensional setting
Objective Value
better
Test Error
Pascal Large Scale Learning Challenge
http//www.mlbench.org/instructions/
Time (sec)
28
Experiments Logistic Regression
rcv1 dataset (Lewis et al, 2004)
high-dimensional setting
Objective Value
better
Test Error
Time (sec)
29
Shotgun Improving Self-speedup
Lasso Time Speedup
Logistic Regression Time Speedup
Max
Max
Speedup (time)
Speedup (time)
Mean
Mean
Min
Min
cores
cores
Lasso Iteration Speedup
Max
Mean
Min
Speedup (iterations)
cores
30
Shotgun Self-speedup
Lasso Time Speedup
Not so great ?
Max
Speedup (time)
Mean
Min
cores
Max
Mean
But we are doing fewer iterations! ?
Min

Write a Comment

User Comments (0)