Title: Numerical Linear Algebra in the Streaming Model
1Numerical Linear Algebra in the Streaming Model
- Ken Clarkson - IBM
- David Woodruff - IBM
2The Problems
- Given n x d matrix A and n x d matrix B, we want
estimators for - The matrix product AT B
- The matrix X minimizing AX-B
- A slightly generalized version of linear
regression - Given integer k, the matrix Ak of rank k
minimizing A-Ak - We consider the Frobenius matrix norm square
root of sum of squares
3General Properties of Our Algorithms
- 1 pass over matrix entries, given in any order
(allow multiple updates) - Maintain compressed versions or sketches of
matrices - Do small work per entry to maintain the sketches
- Output result using the sketches
- Randomized approximation algorithms
- Since we minimize space complexity, we restrict
matrix entries to be O(log nc) bits, or O(log nd)
bits
4Matrix Compression Methods
- In a line of similar efforts
- Element-wise sampling AM01, AHK06
- Row / column sampling pick small random subset
of the rows, columns, or both DK01, DKM04,
DMM08 - Sketching / Random Projection maintain a small
number of random linear combinations of rows or
columns S06 - Usually more than 1 pass
- Here sketching
5Outline
- Matrix Product
- Linear Regression
- Low-rank approximation
6An Optimal Matrix Product Algorithm
- A and B have n rows, and a total of c columns,
and we want to estimate ATB, so that Est-ATB
eAB - Let S be an n x m sign (Rademacher) matrix
- Each entry is 1 or -1 with probability ½
- m small, set to O(log 1/ d) e-2
- Entries are O(log 1/d)-wise independent
- Observation
- EATSSTB/m ATESSTB/m ATB
- Wouldnt it be nice if all the algorithm did was
maintain STA and STB, and output ATSSTB/m?
7An Optimal Matrix Product Algorithm
- This does work, and we are able to improve the
previous dependence on m - New Tail Estimate for d, e gt 0, there is m
O(log 1/ d) e-2 so that - PrATSSTB/m-ATB gt e A B d
- (again C Si, j Ci, j21/2)
- Follows from bounding O(log 1/d)-th moment of
ATSSTB/m-ATB
8Efficiency
- Easy to maintain sketches given updates
- O(m) time/update, O(mc log(nc)) bits of space for
STA and STB - Improves Sarlos algorithm by a log c factor.
- Sarlos algorithm based on JL Lemma
- JL preserves all entries of ATB up to an additive
error, whereas we only preserve overall error - Can compute ATSSTB/m via fast rectangular
matrix multiplication
9Matrix Product Lower Bound
- Our algorithm is space-optimal for constant d
- a new lower bound
- Reduction from a communication game
- Augmented Indexing, players Alice and Bob
- Alice has random x 2 0,1s
- Bob has random i 2 1, 2, , s
- also xi1, , xs
- Alice sends Bob one message
- Bob should output xi with probability at least
2/3 - Theorem MNSW Message must be ?(s) bits on
average
10Lower Bound Proof
- Set s (ce-2log cn)
- Alice makes matrix U
- Uses x1xs
- Bob makes matrix U and B
- Uses i and xi1, , xs
- Alg input will be AUU and B
- A and B are n x c/2
- Alice
- Runs streaming matrix product Alg on U
- Sends Alg state to Bob
- Bob continues Alg with A U U and B
- ATB determines xi with probability at least 2/3
- By choice of U, U, B
- Solving Augmented Indexing
- So space of Alg must be ?(s) ?(ce-2log cn) bits
11Lower Bound Details
- U U(1) U(2) , U(log (cn)) 0s
- Each U(k) is an (e-2) x c/2 submatrix with
entries in - -10k, 10k
- U(k)i, j 10k if matched entry of x is 0, else
U(k)i, j -10k - Bobs index i corresponds to U(k)i, j
- U is such that A UU U(1) U(2) , U(k)
0s - U is determined from xi1, , xs
- ATB is i-th row of U(k)
- A ¼ U(k) since the entries of A are
geometrically increasing - e2A2 B2, the squared error, is small, so
most entries of the approximation to ATB have the
correct sign
12Outline
- Matrix Product
- Linear Regression
- Low-rank approximation
13Linear Regression
- The problem minX AX-B
- X minimizing this has X A-B, where A- is the
pseudo-inverse of A - Every matrix A USVT using singular value
decomposition (SVD) - If A is n x d of rank k, then
- U is n x k with orthonormal columns
- S is k x k diagonal matrix, diagonal is positive
- V is d x k with orthonormal columns
- A- VS-1UT
- Normal Equations ATA X ATB for optimal X
14Linear Regression
- Let S be an n x m sign matrix, m
O(de-1log(1/d)) - The algorithm is
- Maintain STA and STB
- Return X solving minX ST(AX-B)
- Space is O(d2 e-1log(1/d)) words
- Improves Sarlos space by log c factor
- Space is optimal via new lower bound
- Main claim With probability at least 1- d,
- AX-B (1e)AX-B
- That is, relative error for X is small
15Regression Analysis
- Why should X solving minX ST(AX-B) be good?
- ST approximately preserves AX-B for fixed X
- If this worked for all X, were done
- ST must preserve norms even for X, chosen using
S - First reduce to showing that A(X-X) is
small - Use normal equation ATAX ATB
- Implies AX-B2 AX-B2 A(X-X)2
- Bounding A(X-X)2 equivalent to bounding
UTA(X-X)2, where A USVT, from SVD, and U
is an orthonormal basis of the columnspace of A
16Regression Analysis Continued
- Bounding 2 UTA(X-X)2
- UTSSTU/m UTSSTU/m-
- Normal equations in sketch space imply
- (STA)T(STA)X (STA)T(STB)
- UTSSTU UTSSTA(X-X)
- UTSSTA(X-X) UTSST(B-AX)
- UTSST(B-AX)
- UTSSTU/m UTSST(B-AX)/m
- (e/k)1/2UB-A
X (new tail estimate) - e1/2 B-AX
17Regression Analysis Continued
- Hence, 2 UTA(X-X)2
- UTSST/m UTSSTU/m-
- e1/2 B-AX UTSSTU/m-
- Recall the spectral norm A2 supx
Ax/x - Implies CD C2 D
- UTSSTU/m- UTSSTU/m-I 2
- Subspace JL for m ?(k log(1/d)), ST
approximately preserves lengths of all vectors in
a k-space - UTSSTU/m- /2
- 2e1/2AX-B
- AX-B2 AX-B2 2 AX-B2
4eAX-B2
18Regression Lower Bound
- Tight ?(d2 log (nd) e-1) space lower bound
- Again a reduction from augmented indexing
- This time more complicated
- Embed log (nd) e-1 independent regression
sub-problems into hard instance - Uses deletions and geometrically growing
property, as in matrix product lower bound - Choose the entries of A and b so that the
algorithms output x encodes some entries of A
19Regression Lower Bound
- Lower bound of ?(d2) already tricky because of
bit complexity - Natural approach
- Alice has random d x d sign matrix A-1
- b is a standard basis vector ei
- Alice computes A (A-1)-1 and puts it into the
stream. Solution x to minx Axb is i-th
column of A-1 - Bob can isolate entries of A-1, solving indexing
- Wrong A has entries that can be exponentially
small!
20Regression Lower Bound
- We design A and b together (Aug. Index)
1 A1, 2 A1, 3 A1, 4 A1,5 0 1 A2, 3
A2, 4 A2,5 0 0 1 A3, 4
A3,5 0 0 0 1 A4,5 0 0
0 0 1
x1 x2 x3 x4 x5
0 A2,4 A3,4 1 0
x5 0, x4 1, x3 0, x2 0, x1 -A1,4
21Outline
- Matrix Product
- Regression
- Low-rank approximation
22Best Low-Rank Approximation
- For any matrix A and integer k, there is a matrix
Ak of rank k that is closest to A among all
matrices of rank k. - Since rank of Ak is k, it is the product CDT of
two k-column matrices C and D - Ak can be found from the SVD (singular value
decomposition), where C and D are orthogonal
matrices U and VSk - This is a good compression of A
- LSI, PCA, recommendation systems, clustering
23Best Low-Rank Approximation
- Previously, nothing was known for 1-pass low-rank
approximation and relative error - Even for k 1, best upper bound O(nd log (nd))
bits - Problem 28 of Mut can one get sublinear space?
- We get 1-pass and O(ke-2(nde-2)log(nd)) space
- Update time is O(ke-4), so total work is
O(Nke-4), where N is the number of non-zero
entries of A - New space lower bound shows optimal up to 1/e
24Best Low-Rank Approximation and STA
- The sketch STA holds information about A
- In particular, there is a rank k matrix Ak in
the rowspace of STA nearly as close to A as the
closest rank k matrix Ak - The rowspace of STA is the set of linear
combinations of its rows - That is, A-Ak (1e)A-Ak
- Why is there such an Ak?
25Low-Rank Approximation via Regression
- Apply the regression results with A ! Ak, B ! A
- The X minimizing ST(AkX-A) has
- AkX-A (1 e)Ak X-A
- But here X I, and X (ST Ak)- STA
- So the matrix AkX Ak(STAk)-STA
- Has rank k
- In the rowspace of STA
- Within 1e of smallest distance of any rank-k
matrix
26Low-Rank Approximation in 2 Passes
- Cant use Ak(STAk)- STA without finding Ak
- Instead maintain STA
- Can show that if GT has orthonormal rows, then
the best rank-k approximation to A in the
rowspace of GT is AkGT, where Ak is the best
rank-k approximation to AG - After 1st pass, compute orthonormal basis GT for
rowspace of STA - In 2nd pass, maintain AG
- Afterwards, compute Ak and AkGT
27Low-Rank Approximation in 2 Passes
- Ak is best rank-k approximation to AG
- For any rank-k matrix Z,
- AGGT AkGT AG-Ak
- AG-Z
- AGGT ZGT
- For all Y (AGGT-YGT) (A-AGGT)T 0, so we can
apply Pythagorean Theorem twice - A AkG A-AGGT AGGT AkGT
- A-AGGT AGGT ZGT
- A ZGT
281 Pass Algorithm
- With high probability,() AX-B
(1e)AX-B, where - X minimizes AX-B
- X minimizes STAX-STB
- Apply () with A ! AR and B ! A and X minimizing
ST(ARX-A) - So X (STAR)-STA has ARX-A (1e)minX
ARX-A - Columnspace of AR contains a (1e)-approximation
to Ak - So, ARX-A (1e)minX ARX-A (1e)2
minX A-Ak - Key idea ARX AR(STAR)-STA is
- easy to compute in 1-pass with small space and
fast update time - behaves like A (similar to SVD)
- use it instead of A in our 2-pass algorithm!
291 Pass Algorithm
- Algorithm
- Maintain AR and STA
- Compute AR(STAR)-STA
- Let GT be an orthonormal basis for the rowspace
of STA, as before - Output the best rank-k approximation to
- AR(STAR)-STA in the rowspace of STA
- Same as 2-pass algorithm except we dont need a
second pass to project A onto the rowspace of STA - Analysis is similar to that for regression
30A Lower Bound
binary string x
matrix A
index i and xi1, xi2,
k e-1 columns per block
0s
k rows
-1000, -1000 -1000, 1000 1000, 1000
0, 0 0, 0 0, 0
10000, -10000 -10000, -10000 -10000, -10000
0, 0 0, 0 0, 0
0 0
10, -10 -10, 10 -10,-10
-100, -100 100, 100 100, 100
n-k rows
Error now dominated by block of interest Bob
also inserts a k x k identity submatrix into
block of interest
31Lower Bound Details
Block of interest
k e-1 columns
k rows
0s
0s
PIk
n-k rows
Bob inserts k x k identity submatrix, scaled by
large value P Show any rank-k approximation must
err on all of shaded region So good rank-k
approximation likely has correct sign on Bobs
entry
32Concluding Remarks
- Space bounds are tight for product, regression
- Sharpen prior upper bounds
- Prove optimal lower bounds
- Space bounds off by a factor of e-1 for low-rank
approximation - First sub-linear (and near-optimal) 1-pass
algorithm - We have better upper bounds for restricted cases
- Improve the dependence on e in the update time
- Lower bounds for multi-pass algorithms?