Title: Beyond Streams and Graphs: Dynamic Tensor Analysis
1Beyond Streams and Graphs Dynamic Tensor
Analysis
Dacheng Tao
Christos Faloutsos
Speaker Jimeng Sun
2Motivation
- Goal incremental pattern discovery on streaming
applications - Streams
- E1 Environmental sensor networks
- E2 Cluster/data center monitoring
- Graphs
- E3 Social network analysis
- Tensors
- E4 Network forensics
- E5 Financial auditing
- E6 fMRI Brain image analysis
- How to summarize streaming data effectively and
incrementally?
3E3 Social network analysis
- Traditionally, people focus on static networks
and find community structures - We plan to monitor the change of the community
structure over time and identify abnormal
individuals
4E4 Network forensics
- Directional network flows
- A large ISP with 100 POPs, each POP 10Gbps link
capacity Hotnets2004 - 450 GB/hour with compression
- Task Identify abnormal traffic pattern and find
out the cause
abnormal traffic
normal traffic
destination
source
Collaboration with Prof. Hui Zhang and Dr.
Yinglian Xie
5Static Data model
- For a timestamp, the stream measurements can be
modeled using a tensor - Dimension a single stream
- E.g, ltChristos, graphgt
- Mode a group of dimensions of the same kind.
- E.g., Source, Destination, Port
Time 0
Source
Destination
6Static Data model (cont.)
- Tensor
- Formally,
- Generalization of matrices
- Represented as multi-array, data cube.
Order 1st 2nd 3rd
Correspondence Vector Matrix 3D array
Example
7Dynamic Data model (our focus)
Source
Destination
time
- Streams come with structure
- (time, source, destination, port)
- (time, author, keyword)
8Dynamic Data model (cont.)
- Tensor Streams
- A sequence of Mth order tensor
Order 1st 2nd 3rd
Correspondence Multiple streams Time evolving graphs 3D arrays
Example
keyword
time
author
9Dynamic tensor analysis
New Tensor
Old Tensors
Source
Destination
UDestination
Old cores
USource
10Roadmap
- Motivation and main ideas
- Background and related work
- Dynamic and streaming tensor analysis
- Experiments
- Conclusion
11Background Singular value decomposition (SVD)
- SVD
- Best rank k approximation in L2
- PCA is an important application of SVD
n
n
k
k
k
VT
A
?
U
UT
m
m
Y
12Latent semantic indexing (LSI)
- Singular vectors are useful for clustering or
correlation detection
cluster
cache
frequent
query
pattern
concept-association
DM
x
x
DB
document-concept
concept-term
13Tensor Operation Matricize X(d)
- Unfold a tensor into a matrix
Acknowledge to Tammy Kolda for this slide
14Tensor Operation Mode-product
- Multiply a tensor with a matrix
port
port
source
?source
destination
destination
group
group
source
15Related work
Low Rank approximation PCA, SVD orthogonal based projection Multilinear analysis Tensor decompositions Tucker, PARAFAC, HOSVD
Stream mining Scan data once to identify patterns Sampling Vitter85, Gibbons98 Sketches Indyk00, Cormode03 Graph mining Explorative Faloutsos04Kumar99 Leskovec05 Algorithmic Yan05Cormode05
Our Work
16Roadmap
- Motivation and main ideas
- Background and related work
- Dynamic and streaming tensor analysis
- Experiments
- Conclusion
17Tensor analysis
- Given a sequence of tensors
- find the projection matrices
- such that the reconstruction error e
- is minimized
t
Note that this is a generalization of PCA when n
is a constant
18Why do we care?
- Anomaly detection
- Reconstruction error driven
- Multiple resolution
- Multiway latent semantic indexing (LSI)
Philip Yu
time
Michael Stonebreaker
Pattern
Query
191st order DTA - problem
- Given x1xn where each xi? RN, find
- U?RN?R such that the error e is
- small
N
UT
Y
x1
R
?
Sensors
.
n
time
xn
indoor
Note that Y XU
Sensors
outdoor
201st order DTA
- Input new data vector x? RN, old variance matrix
C? RN? N - Output new projection matrix U? RN? R
- Algorithm
- 1. update variance matrix Cnew xTx C
- 2. Diagonalize U?UT Cnew
- 3. Determine the rank R and return U
Old X
time
x
x
UT
U
Cnew
C
xT
Diagonalization has to be done for every new x!
211st order STA
- Adjust U smoothly when new data arrive without
diagonalization VLDB05 - For each new point x
- Project onto current line
- Estimate error
- Rotate line in the direction of the error and in
proportion to its magnitude
- For each new point x and for i 1, , k
- yi UiTx (proj. onto Ui)
- di ? ?di yi2 (energy ? i-th eigenval.)
- ei x yiUi (error)
- Ui ? Ui (1/di) yiei (update estimate)
- x ? x yiUi (repeat with remainder)
error
Sensor 2
U
Sensor 1
22Mth order DTA
23Mth order DTA complexity
- Storage
- O(? Ni), i.e., size of an input tensor at a
single timestamp - Computation
- ? Ni3 (or ? Ni2) diagonalization of C
- ? Ni ? Ni matrix multiplication X (d)T
X(d) - For low order tensor(lt3), diagonalization is the
main cost - For high order tensor, matrix multiplication is
the main cost
24Mth order STA
- Run 1st order STA along each mode
- Complexity
- Storage O(? Ni)
- Computation ? Ri ? Ni which is smaller than DTA
25Roadmap
- Motivation and main ideas
- Background and related work
- Dynamic and streaming tensor analysis
- Experiments
- Conclusion
26Experiment
- Objectives
- Computational efficiency
- Accurate approximation
- Real applications
- Anomaly detection
- Clustering
27Data set 1 Network data
- TCP flows collected at CMU backbone
- Raw data 500GB with compression
- Construct 3rd order tensors with hourly windows
with ltsource, destination, portgt - Each tensor 500?500?100
- 1200 timestamps (hours)
value
Sparse data
Power-law distribution
10AM to 11AM on 01/06/2005
28Data set 2 Bibliographic data (DBLP)
- Papers from VLDB and KDD conferences
- Construct 2nd order tensors with yearly windows
with ltauthor, keywordsgt - Each tensor 4584?3741
- 11 timestamps (years)
29Computational cost
3rd order network tensor
2nd order DBLP tensor
- OTA is the offline tensor analysis
- Performance metric CPU time (sec)
- Observations
- DTA and STA are orders of magnitude faster than
OTA - The slight upward trend in DBLP is due to the
increasing number of papers each year (data
become denser over time)
30Accuracy comparison
3rd order network tensor
2nd order DBLP tensor
- Performance metric the ratio of reconstruction
error between DTA/STA and OTA fixing the error
of OTA to 20 - Observation DTA performs very close to OTA in
both datasets, STA performs worse in DBLP due to
the bigger changes.
31Network anomaly detection
- Reconstruction error gives indication of
anomalies. - Prominent difference between normal and abnormal
ones is mainly due to unusual scanning activity
(confirmed by the campus admin).
32Multiway LSI
Authors Keywords Year
michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995
surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,process,cache 2004
jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri 2004
DB
DM
- Two groups are correctly identified Databases
and Data mining - People and concepts are drifting over time
33Conclusion
- Tensor stream is a general data model
- DTA/STA incrementally decompose tensors into core
tensors and projection matrices - The result of DTA/STA can be used in other
applications - Anomaly detection
- Multiway LSI
34Final word Think structurally!
- The world is not flat, neither should data mining
be.
Contact Jimeng Sun jimeng_at_cs.cmu.edu