Title: Naftali Tishby
1Algebraic and Information Theoretic Methods for
Network Anomaly Detection
- Naftali Tishby
- School of Computer Science and Engineering
- Hebrew University of Jerusalem
- tishby_at_cs.huji.ac.il
- http//www.cs.huji.ac.il/tishby
NATO ASI School, Villa Cangola, Gazzada, Italy
2Outline
- Statement of the problem
- Signals, observables, correlations and graphs
- Common Security Issues
- Topology
- critical nodes, connected groups
- Social
- find his friends, how are they linked?
- Temporal
- Anomaly detection
- providing warnings (predicting events)
- Algebraic methods I - static networks
- using graph the Laplacian
- Algebraic methods II dynamic networks
- Time dependent graphs
- Diffusion on variable graphs
- Predictive Information
- What in the past is predictive?
- The Information Bottleneck method
3- Algebraic methods I static networks
-
- Connected components and flow bottlenecks
- Min-Cut and Max-flow
- Currents and potentials on graphs
- Minimal energy configuration
- Using Kirchhoff's laws
- Graph Laplacian
- Laplace equation on static graphs
- Diffusion on graphs
- Spectral decomposition of graphs
- Spectral embedding and clustering
- Spectral filtering
4Outline (iii)
- Algebraic methods II dynamic networks
-
- Links can be time-dependent
- Time dependent graphs
- Diffusion on variable graphs
- Novelty detection on graphs
- Measuring distance between graphs
- Autoregressive models of graphs
5Outline Part II
- Statistical and Information Theoretic methods for
Dynamic Network Analysis -
- Predictive Information
- What in the past is predictive?
- Predictive information is sub-exponential!
- Finding predictive statistics
- Prediction suffix trees
- The Information Bottleneck method
6Statement of the problem
- Signals, observables and correlations
-
- Examples
- Family relations
- Social networks (education, joint activities,)
- Telephone calls
- Sensor arrays (spatial-temporal signals)
- Computer networks
- ....
- Neurons in the brain
- Biochemical networks
- General Co-Occurrence data (words-topics,)
- .
7(No Transcript)
8Biological neural networks
9Biochemical interactions
10Gene expression data
Gene expression analysis
Genes
Samples
11(No Transcript)
12Example
Wireless Sensor Networks
Wireless Transceiver
Sensing and Processing Unit
Battlefield surveillance, disaster relief, border
control, environment monitoring, etc.
13An Object Moving Through the Network
A moving object corresponds to a spatial peak
moving with time Target tracking corresponds to
determining the peak location over time
14A Simple Model for the Spatio-Temporal Signal
Field
s(x,y,t) stationary complex Gaussian field in
(x,y,t)
spatial bandwidths in x and y dimensions
and
temporal signal bandwidth
Spatial Coherence Region (SCR)
(coherence distance)
15Spatial Sampling via Sensors
Distance-bandwidth (DB) products
K total number of nodes
Spatial DoF
Number of independent SCRs
Spatial Coherence Region (SCR)
Number of nodes in each SCR
(Oversampling per DoF)
16Graph Theoretic Formulation
- Signals, observables
- ? Activity (charge, field) of nodes (V) in a
graph - Correlations, distances, co-occurrence frequency,
- ? Weights (current, flow) on links (E) in a
graph
i
j
Wi,j
Graph G(V,E) is a pair contains a vector of
nodes (or vertices, V) with a matrix, W, of
non-negative weights on the links (edges, E).
We first assume that the weight matrix is
symmetric Wi,jWj,i
17Undirected graph ? Symmetric matrix
18Security Issues
- Topology
- Identifying critical nodes
- Centrality - high degree
- Bottleneck max-flow
- Strongly connected groups - cliques, min-cuts
- Social
- find his friends diffusion on the graph
- how are they linked? - collective network flow
- Temporal
- Connectivity changes
- Critical times (providing warnings)?
- Anomaly prediction
19Algebraic Methods Static Networks
- Functions and optimization on graphs
- A simple physical example
- Currents and potentials on a network of
conductors - electric potential function defined on each
node ?i - electric conductance of edge Wi,j
- Total electric power (energy dissipation rate) of
network - E(?) ½ ?i,j Wi,j (?i - ?j )2
- We are interested in the minimal power
configuration - under potential constraints on some nodes
(Kirchhoff's laws) - Min? E(?) - ?k mk ?k , or Min?½?i,j Wi,j
(?i - ?j )2 - ?k mk ?k
20Algebraic Methods Static Networks
- The quadratic energy function can be written as a
quadratic form - E(?) ½?i,j Wi,j (?i - ?j )2 ½?i,j Wi,j (?I2
- 2?i ?j ?j2 ) ?i,j Li,j ?i ?j - Where the Graph Laplacian is defined as
- L(G)D-W
- with Dj,j?i Wi,j , a diagonal matrix.
- The minimum energy potential is the solution of
the linear - Laplace equation on a static graph
- L(G)? m
- The Lagrange multipliers m are the currents
through the nodes and must sum to zero. This
corresponds to Kirchhoff's laws.
21Algebraic Methods Static Networks (cont.)
- Notice that L(G) is a singular matrix the
constant vector ? is always an eigenvector with
eigenvalue 0. - In fact, the multiplicity of the zero eigenvalue,
dim(ker(L)), is precisely the number of connected
components of the graph! - This is the basis for important algorithms known
as - spectral graph partitioning and spectral
clustering - (Ng, Jordan, Weiss 2000, and others)
- Represent data-points by their Laplacian
eigenvectors with low eigenvalues (spectral
embedding) - Apply your favorite Euclidean space clustering
method to this representation.
22Laplacian eigenvector decomposition
- Similar to Fourier decomposition
- Orthogonal basis, projectors
- Eigenvalues spatial frequencies
L(G) xk ?k xk
X1
X2
X3
23Application Using Spectral Embedding for
Novelty Detection in communication networks
- Represent the connection matrix using its
Laplacian eigenvector decomposition - Filter the high frequency (small) components to
remain with the large connected blocks - Use the filtered connections as a reference and
measure the distance of current graph to the
reference.
24Reordering the nodes based on Spectral
decomposition
25Network activity as a Graph
Network can be represented by nodes and arcs,
where the nodes correspond to the sensors and the
arcs to correlations between sensors
26Simple illustration
27Distances between graphs
- Since graphs are represented by weight matrices
we can use any matrix norm to measure graph
similarity - We can expand the Laplacian in its eigenvectors
- Where pi is a projection on the i-th eigenvector
-
- a projection on first K eigenvectors a
low-pass filter!
28- We compare two networks by the volume of the
overlap between their two K-projections, also
known as the generalized-angle between the two
K-dim subspaces, a and b - or
-
- We use this angle to detect anomalies in the
network - activity.
29(No Transcript)
30(No Transcript)
31Example of the anomaly detection
over 18 days of activity, our measure detected 3
days with significantly low volume overlap with
the reference, turned out to truly Indicate
anomalous activity for this network.
32Very simple example
40 connectivity matrices were built from 2
different templates with noise. First 20
Template 1, Second 20 Template 2
33The connectivity graphs
34The embedded representation of the neurons as a
function of time
35Applying our method on the example
Reference matrix template 1
Reference matrix template 2
36Applying our method on the example(finding the
regularity)
37Correlation coefficients between 8 Gpe neurons,
while a monkey preformed a task.
38Diffusion on Graphs
The next question who is connected with
whom? This can be modeled as a spread of a
disease through a population, a forest fire, or
ink on paper through diffusion The graph
Laplacian is also the generator for random walks
on the graph, or diffusion. Denoting by
the density on the node x at time t, then
the diffusion equation on the graph
is a linear differential equation,
which is solved by the matrix exponential
39- Computational comment
- Matrix exponentials can be calculated very
efficiently - for very large sparse matrices using Krylov
subspaces methods. There is a free software tool
(Expokit) for solving this type of equations in
high dimension. - One may want to include decay (e.g. death of the
virus), sources or sinks. This is simple in this
formulation, by adding a non-homogeneous term to
the diffusion -
- which can be solved with similar matrix
exponentials.
40Diffusion on time dependent graphs
- When the connection graphs change in time we can
still calculate diffusion very efficiently for
large matrices - The general solution is (for discrete time
changes) - This allows to identify connected nodes due to
many factors in a time dependent and complex
environments.
41- Predictive Information
-
- What in the past is predictive?
- Predictive information is sub-exponential!
- Finding predictive statistics
- Prediction suffix trees
- The Information Bottleneck method
- Extracting relevant statistics from large data
42Why Predictability?
- X1, X2, Xn,.
- time series extrapolation isdifferent than
prediction - Prediction is Probabilistic
43Why Predictability? Life is all about
predictions
44Predictive Information(with Bialek and Nemenman,
2001)
t0
W(t)
future
past
W(-)- T-window
W()- T-window
- Estimate PT(W(-),W()) T- past-future
distribution
45Predictive Information
- When looking at a time dependent process, W(t),
what is the information in its past about its
future?
- This is a sub-extensive quantity - grows
sub-linearly with the time - window. New Can be
estimated efficiently! - Characterizes the processs complexity.
- Grows logarithmically when generated by finite
dim systems, and as a power law for more complex
cases (like natural language or music).
46Logarithmic growth for finite dimensional
processes
- Finite parameter processes (e.g.
Markov chains) - Similar to stochastic complexity (MDL)
47Power law growth
- Such fast growth is a signature of infinite
dimensional processes - Power laws emerges in cases where the
interactions/correlations have long range
48Entropy of words in a Spin Chain
49Entropy of 3 Generated Chains
Entropy is Extensive it shows No distinction
between the cases !
50Predictive Information Subextensive Component
of the Entropy
shows a qualitative distinction between the cases!
Subextensive component growth is reflecting the
underlying complexity!
51But WHAT - in the past - is predictive ?
t0
W(t)
future
past
W(-)- T-window
W()- T-window
- Using the Information Bottleneck method
- Solve
- MinZ I(W(-)Z) - ? I(W()Z) for all ? gt0
- T- past-future information curve
ITF(ITP) - IFuture(IPast) limT! 1 ITF(ITP)
52IT3F (IT3P)
IT2F (IT2P)
IT1F (IT1P)
IFuture
The limit is always the convex envelope of
increasing time-windows Information Curves
Ipast
- Can be calculated analytically for Markov chains,
Gaussian processes, etc., and numerically in
general, in a distribution free way.
53The Information Bottleneck Method
N. Tishby, F.C. Pereira and W. Bialek, 1999.
54The IB algorithm generalized Arimoto-Blahut
for RDT
55Why is the predictive information
interesting?It determines the level of
adaptation of organisms to stochastic
environments
Interesting Key example Human Languages
56Variable Memory Markov Modelsand Prediction
Suffix Tree Learning Ron, Singer, Tishby,
1995,96
- Can be learned accurately and efficiently
- Can capture high-order correlations (better than
HMM) - Effectively capture short features and
motifs
57Complexity Accuracy Tradeoff
Accuracy
Possible Models/representations
Complexity
58Simplified Chinese 2.09
Traditional Chinese 1.73
Dutch 2.3
French 2.22
Hebrew 1.63
Italian 2.35
Japanese 1.42
Portuguese 2.9
Spanish 1.89
59Can we understand it?
60Many Thanks to
- Bill Bialek
- Ilya Nemenman
- Naama Parush
- Jonathan Rubin
- Eli Nelken
- Dmitry Davidov
- Felix Creutzig
- Amir Globerson
- Gal Chechik
- Roi Weiss