Title: Randomized%20Algorithms%20for%20Linear%20Algebraic%20Computations%20
1Randomized Algorithms for Linear Algebraic
Computations Applications
Petros Drineas Rensselaer Polytechnic
Institute Computer Science Department
To access my web page
drineas
2Matrices and network applications
Matrices represent networks Graphs are often
used in network applications. (e.g., network
traffic monitoring, web structure analysis,
social network mining, protein interaction
networks, etc.) Graphs are represented by
matrices. (e.g., adjacency matrices,
edge-incidence matrices, etc.) More recently
time-evolving graphs, represented by tensors
multi-mode arrays in recent literature. Goal
discover patterns and anomalies in spite of the
high dimensionality of data.
3Matrices and network applications
Matrices represent networks Graphs are often
used in network applications. (e.g., network
traffic monitoring, web structure analysis,
social network mining, protein interaction
networks, etc.) Graphs are represented by
matrices. (e.g., adjacency matrices,
edge-incidence matrices, etc.) More recently
time-evolving graphs, represented by tensors
multi-mode arrays in recent literature. Goal
discover patterns and anomalies in spite of the
high dimensionality of data. Linear algebra and
numerical analysis provide the fundamental
mathematical and algorithmic tools to deal with
matrix and tensor computations.
4Randomized algorithms
- Randomization and sampling allow us to design
provably accurate algorithms for problems that
are - Massive
- (e.g., matrices (graphs) so large that can not be
stored at all, or can only be stored in slow,
secondary memory devices) - Computationally expensive or even NP-hard
- (e.g., combinatorial problems such as the Column
Subset Selection Problem)
5Example monitoring IP flows
n destinations
Network administrator monitoring the (source,
destination) IP flows over time
- Tasks
- Find patterns, summaries, anomalies, etc. in the
static setting, or - in the dynamic setting (tensor representation).
m sources
Aij count of exchanged flows between i-th
source and j-th destination
6Interplay
Applications Data mining information
retrieval (e.g., human genetics, internet data,
electronic circuit testing data, etc.)
Theoretical Computer Science Randomized and
approximation algorithms
Numerical Linear Algebra Matrix computations and
Linear Algebra (ie., perturbation theory)
7Students and collaborators
Students Collaborators
Asif Javed (PhD, graduated) M.W. Mahoney (Stanford U)
Christos Boutsidis (PhD, 3rd year) J. Sun (IBM T.J. Watson Research Center)
Jamey Lewis (PhD, 2nd year) S. Muthu (Google)
Elena Sebe (undergrad, now at RPI) E. Ziv (UCSF, School of Medicine)
John Payne (undergrad, now at CMU) K. K. Kidd (Yale U, Dept. of Genetics)
Richard Alimi (undergrad, now at Yale) P. Paschou (Democritus U., Greece, Genetics)
- Our group keeps close ties with
- Yahoo! Research
- Sandia National Laboratories
- Funding from NSF and Yahoo! Research.
8Overview
- From the Singular Value Decomposition (SVD) to
CUR-type decompositions - Additive and relative error CUR-type
decompositions - Future directions
9The Singular Value Decomposition (SVD)
n features
Matrix rows points (vectors) in a Euclidean
space, e.g., given 2 objects (x d), each
described with respect to two features, we get a
2-by-2 matrix. Two objects are close if the
angle between their corresponding vectors is
small.
m objects
10SVD, intuition
Let the blue circles represent m data points in a
2-D Euclidean space. Then, the SVD of the m-by-2
matrix of the data will return
11Singular values
?2
?1 measures how much of the data variance is
explained by the first singular vector. ?2
measures how much of the data variance is
explained by the second singular vector.
?1
12SVD formal definition
? rank of A U (V) orthogonal matrix containing
the left (right) singular vectors of A. S
diagonal matrix containing the singular values of
A.
13Rank-k approximations via the SVD
?
A
VT
U
features
significant
sig.
noise
noise
significant
noise
objects
14Rank-k approximations (Ak)
Uk (Vk) orthogonal matrix containing the top k
left (right) singular vectors of A. S k diagonal
matrix containing the top k singular values of A.
15PCA and SVD
Principal Components Analysis (PCA) essentially
amounts to the computation of the Singular Value
Decomposition (SVD) of a covariance matrix. SVD
is the Rolls-Royce and the Swiss Army Knife of
Numerical Linear Algebra. Dianne OLeary, MMDS
06
16Uk solves an optimization problem
Given an mn matrix A, we seek an m-by-k matrix C
that minimizes the residual
Frobenius norm
- (1) It turns out that C Uk is (one of many)
solutions to the above problem. - (2) The minimal residual error is equal to the
Frobenius norm of A-Ak. - (3) The above observations hold for any unitarily
invariant norm (e.g., the spectral norm). - Notation PCA is the projection of A on the
subspace spanned by the columns of C.
17SVD issues
SVD and PCA are often used to summarize and
approximate matrices and they have enjoyed
enormous success in data analysis. BUT
18SVD issues
SVD and PCA are often used to summarize and
approximate matrices and they have enjoyed
enormous success in data analysis. BUT - For
large sparse graphs they require large amounts of
memory, exactly because the resulting matrices
are not sparse any more. (Common networks such
as the web, the Internet topology graph, the
who-trusts-whom social network are all large and
sparse.) - Running time becomes an issue. -
Assigning meaning to the singular vectors
(reification) is tricky
19Alternative CUR-type decompositions
- A sketch consisting of a few rows/columns of
the matrix may replace the SVD. - Rows/columns are drawn randomly, using various
importance sampling schemes. - The choice of the sampling probabilities is
critical for the accuracy of the approximation.
20 Advantages
- In applications where large, sparse matrices
appear we can design algorithms for CUR-type
decompositions that - are computable after two passes (sequential
READS) through the matrices, - require O(mn) RAM space (compare to O(mn) for
the SVD), - can be computed very fast (extra O(mn) time
after the two passes), and - preserve sparsity.
21 Advantages
- In applications where large, sparse matrices
appear we can design algorithms for CUR-type
decompositions that - are computable after two passes (sequential
READS) through the matrices, - require O(mn) RAM space (compare to O(mn) for
the SVD), - can be computed very fast (extra O(mn) time
after the two passes), and - preserve sparsity.
- Caveat accuracy loss.
22Some notation
Given an mn matrix A (i) A(i) denotes the i-th
column of the matrix as a column vector, (ii)
A(i) denotes the i-th row of the matrix as a row
vector, (iii) A(i) or A(i) denotes the
Euclidean norm of the corresponding row or
column, (iv)
Finally, ? will be an accuracy parameter in (0,1)
and ? a failure probability.
23Computing CUR(Drineas Kannan SODA 03,
Drineas, Kannan, Mahoney SICOMP 06)
C consists of c O(1/e2) columns of A and R
consists of r O(1/e2) rows of A.
24Computing CUR(Drineas Kannan SODA 03,
Drineas, Kannan, Mahoney SICOMP 06)
C consists of c O(1/e2) columns of A and R
consists of r O(1/e2) rows of A.
25Computing U
- Intuition
- The CUR algorithm essentially expresses every
row of the matrix A as a linear combination of a
small subset of the rows of A. - This small subset consists of the rows in R.
26Computing U
- Intuition
- The CUR algorithm essentially expresses every
row of the matrix A as a linear combination of a
small subset of the rows of A. - This small subset consists of the rows in R.
- Given a row of A say A(i) the algorithm
computes a good fit for the row A(i) using the
rows in R as the basis, by approximately solving
Notice that only c O(1) element of the i-th row
are given as input. However, a vector of
coefficients u can still be computed.
27Computing U
Given c elements of A(i) the algorithm computes a
good fit for the row A(i) using the rows in R as
the basis, by approximately solving
In the process of computing U , we fix its rank
to be a positive constant k, which is part of the
input.
28Computing U
Given c elements of A(i) the algorithm computes a
good fit for the row A(i) using the rows in R as
the basis, by approximately solving
In the process of computing U , we fix its rank
to be a positive constant k, which is part of the
input.
Note Since CUR is of rank k, A-CUR2,F gt
A-Ak2,F. Thus, we should choose a k such
that A-Ak2,F is small.
29Error bounds (Frobenius norm)
Assume Ak is the best rank k approximation to A
(through SVD). Then We need to pick O(k/e2)
rows and O(k/e2) columns.
30Error bounds (2-norm)
Assume Ak is the best rank k approximation to A
(through SVD). Then We need to pick O(1/e2)
rows and O(1/e2) columns and set k (1/e)-1.
31Application to network flow data(Sun, Xie,
Zhang, Faloutsos SDM 07)
The data - A traffic trace consisting of TCP
flow records collected at the backbone router of
a class-B university network. - Each record in
the trace corresponds to a directional TCP flow
between two hosts timestamps indicate when the
flow started and finished. - Approx. 22,000
hosts/destinations approx. 800,000 packets per
hour between all host-destination pairs.
32Application to network flow data(Sun, Xie,
Zhang, Faloutsos SDM 07)
The data - A traffic trace consisting of TCP
flow records collected at the backbone router of
a class-B university network. - Each record in
the trace corresponds to a directional TCP flow
between two hosts timestamps indicate when the
flow started and finished. - Approx. 22,000
hosts/destinations approx. 800,000 packets per
hour between all host-destination pairs.
Data from 10am-11am on January 6, 2005
33Application to network flow data (Sun, Xie,
Zhang, Faloutsos SDM 07)
Goal - Detect abnormal (anomalous) hosts by
measuring the reconstruction error for each
host! - In other words, if a host can not be
accurately expressed as a linear combination of a
small set of other hosts, then it potentially
represents an anomaly and should be flagged.
34Application to network flow data (Sun, Xie,
Zhang, Faloutsos SDM 07)
Goal - Detect abnormal (anomalous) hosts by
measuring the reconstruction error for each
host! - In other words, if a host can not be
accurately expressed as a linear combination of a
small set of other hosts, then it potentially
represents an anomaly and should be
flagged. Simulation experiments - Starting
with real flow matrices, abnormalities are
injected in the data by inserting Abnormal
source hosts A source host is randomly selected
and all the corresponding row entries are set to
1 (scanner host that sends flows to every other
destination in the network) Abnormal
destination hosts A destination is randomly
picked, and 90 of its corresponding column
entries are set to 1 (denial of service attack
from a large number of hosts).
35Application to network flow data (Sun, Xie,
Zhang, Faloutsos SDM 07)
Goal - Detect abnormal (anomalous) hosts by
measuring the reconstruction error for each
host! - In other words, if a host can not be
accurately expressed as a linear combination of a
small set of other hosts, then it potentially
represents an anomaly and should be
flagged. Simulation experiments - Starting
with real flow matrices, abnormalities are
injected in the data by inserting Abnormal
source hosts A source host is randomly selected
and all the corresponding row entries are set to
1 (scanner host that sends flows to every other
destination in the network) Abnormal
destination hosts A destination is randomly
picked, and 90 of its corresponding column
entries are set to 1 (denial of service attack
from a large number of hosts). Results By
applying a variant of CUR (removing duplicate
rows/columns) and by keeping about 500 columns
and rows, recall is close to 100 and precision
is close to 97.
36More accurate CUR decompositions?
An alternative perspective
Q. Can we find the best set of columns and rows
to include in C and R? Randomized and/or
deterministic strategies are acceptable, even at
the loss of efficiency (running time, memory, and
sparsity).
Optimal U
37More accurate CUR decompositions?
An alternative perspective
Q. Can we find the best set of columns and rows
to include in C and R? Randomized and/or
deterministic strategies are acceptable, even at
the loss of efficiency (running time, memory, and
sparsity).
Optimal U
Prior results by E. E. Tyrtyrshnikov and
collaborators (LAA 97) implied rather weak error
bounds if we choose the columns and rows of A
that define a parallelpiped of maximal volume.
38Relative-error CUR-type decompositions
For any matrix A, we can find C, U and R such
that the norm of A CUR is almost equal to the
norm of A-Ak.
39From SVD to relative error CUR
Exploit structural properties of CUR to analyze
data
n features
m objects
- Instead of reifying the Principal Components
- Use PCA (a.k.a. SVD) to find how many Principal
Components are needed to explain the data. - Run CUR and pick columns/rows instead of
eigen-columns and eigen-rows! - Assign meaning to actual columns/rows of the
matrix! Much more intuitive!
40From SVD to relative error CUR
Exploit structural properties of CUR to analyze
data
n features
Caveat Relative-error CUR-type decompositions
retain sparsity, but necessitate the same amount
of time as the computation of Ak and O(mn) memory.
m objects
- Instead of reifying the Principal Components
- Use PCA (a.k.a. SVD) to find how many Principal
Components are needed to explain the data. - Run CUR and pick columns/rows instead of
eigen-columns and eigen-rows! - Assign meaning to actual columns/rows of the
matrix! Much more intuitive!
41Theorem relative error CUR(Drineas, Mahoney,
Muthukrishnan SIMAX 08)
For any k, O(SVDk(A)) time suffices to construct
C, U, and R s.t. holds with probability at
least .7, by picking O( k log k / ?2 ) columns,
and O( k log2k / ?6 ) rows.
O(SVDk(A)) time to compute the top k left/right
singular vectors and values of A.
42Comparison with additive error CUR
Let Ak be the best rank k approximation to A.
Then, after two passes through A, we can pick
O(k/?4) rows and O(k/?4) columns, such that
43A constant factor CUR construction(Mahoney and
Drineas, PNAS, to appear)
For any k, O(SVDk(A)) time suffices to construct
C, U, and R s.t. holds with probability at
least .7, by picking O( k log k / ?2 ) columns
and O( k log k / ?2 ) rows.
44Relative-error CUR decomposition
Create an approximation to A, using rows and
columns of A
Goal Provide very good bounds for some norm of
A CUR.
- How do we draw the columns and rows of A to
include in C and R? - How do we construct U?
45Step 1 subspace sampling for C
- INPUT matrix A, rank parameter k, number of
columns c - OUTPUT matrix of selected columns C
- Compute the probabilities pi
- For each j 1,2,,n, pick the j-th column of A
with probability min1,cpj - Let C be the matrix containing the sampled
columns - (C has c columns in expectation)
46Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
Subspace sampling
Normalization s.t. the pj sum up to 1
47Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
Subspace sampling
Leverage scores (many references in the
statistics community)
Normalization s.t. the pj sum up to 1
48Step 1 subspace sampling for R
- INPUT matrix A, rank parameter k, number of rows
r - OUTPUT matrix of selected rows R
- Compute the probabilities pi
- For each j 1,2,,m, pick the j-th row of A
with probability min1,rpj. - Let R be the matrix containing the sampled rows.
- (R has r rows in expectation)
49Subspace sampling (Frobenius norm)
Uk orthogonal matrix containing the top k left
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The columns of Uk are orthonormal
vectors, but its rows (Uk)(i) are not.
Subspace sampling
Leverage scores (many references in the
statistics community)
Normalization s.t. the pi sum up to 1
50Leverage (rows)
42
ColumnSelect on AT
33
10
Leverage (columns)
U CAR
14
43
7
ColumnSelect on A
Algorithm CUR C contains columns 7, 14, and 43 of
A R contains rows 10, 33, and 42 of A U
CAR (c r O(k log(k) /?2) in worst case)
51Computing subspace probabilities faster
Subspace sampling (expensive)
Open problem Is it possible to
compute/approximate the subspace sampling
probabilities faster?
52Computing subspace probabilities faster
Subspace sampling (expensive)
Open problem Is it possible to
compute/approximate the subspace sampling
probabilities faster? Partial answer Assuming k
is O(1), it is known that by leveraging the Fast
Johnson-Lindenstrauss transform of (Ailon
Chazelle STOC 06) the singular vectors of a
matrix can be approximated very accurately in
O(mn) time. (Sarlos FOCS 06, Drineas, Mahoney,
Muthukrishnan, Sarlos 07, as well as work by
Tygert, Rokhlin, and collaborators in PNAS)
53Computing subspace probabilities faster
Subspace sampling (expensive)
Open problem Is it possible to
compute/approximate the subspace sampling
probabilities faster? Partial answer Assuming k
is O(1), it is known that by leveraging the Fast
Johnson-Lindenstrauss transform of (Ailon
Chazelle STOC 06) the singular vectors of a
matrix can be approximated very accurately in
O(mn) time. (Sarlos FOCS 06, Drineas, Mahoney,
Muthukrishnan, Sarlos 07, as well as work by
Tygert, Rokhlin, and collaborators in
PNAS) Problem To the best of my knowledge, this
does not immediately imply an algorithm to
(provably) approximate the subspace sampling
probabilities.
54CUR decompositions a summary
G.W. Stewart (Num. Math. 99, TR 04 ) C variant of the QR algorithm R variant of the QR algorithm U minimizes A-CURF No a priori bounds Solid experimental performance
Goreinov, Tyrtyshnikov, Zamarashkin (LAA 97, Cont. Math. 01) C columns that span max volume U W R rows that span max volume Existential result Error bounds depend on W2 Spectral norm bounds!
Williams Seeger (NIPS 01) C uniformly at random U W R uniformly at random Experimental evaluation A is assumed PSD Connections to Nystrom method
Drineas and Kannan (SODA 03, SICOMP 06) C w.r.t. column lengths U in linear/constant time R w.r.t. row lengths Randomized algorithm Provable, a priori, bounds Explicit dependency on A Ak
Drineas, Mahoney, Muthukrishnan (SODA 06, SIMAX 08) C depends on singular vectors of A. U (almost) W R depends on singular vectors of C (1?) approximation to A Ak Computable in O(mnk2) time (experimentally)
Mahoney Drineas (PNAS, to appear) C depends on singular vectors of A. U CAR R depends on singular vectors of A (2?) approximation to A Ak Computable in O(mnk2) time (experimentally)
55Future directions
- Faster algorithms
- Leveraging preprocessing tools (i.e., the Fast
Johnson-Lindenstrauss Transform, Ailon-Chazelle
06) - Deterministic variants
- Upcoming work by C. Boutsidis
- Matching lower bounds
- Implementations and widely disseminated software
- Similar results for non-linear dimensionality
reduction techniques kernel-PCA.