Naftali Tishby - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Naftali Tishby

Description:

0. Naftali Tishby. School of Computer Science and Engineering. Hebrew ... with Dj,j= i Wi,j , ... for more complex cases (like natural language or music) ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 48

Provided by: AkbarS9

Category:

more less

Transcript and Presenter's Notes

Title: Naftali Tishby

1
Algebraic and Information Theoretic Methods for
Network Anomaly Detection

Naftali Tishby
School of Computer Science and Engineering
Hebrew University of Jerusalem
tishby_at_cs.huji.ac.il
http//www.cs.huji.ac.il/tishby

NATO ASI School, Villa Cangola, Gazzada, Italy
2
Outline

Statement of the problem
Signals, observables, correlations and graphs
Common Security Issues
Topology
critical nodes, connected groups
Social
find his friends, how are they linked?
Temporal
Anomaly detection
providing warnings (predicting events)
Algebraic methods I - static networks
using graph the Laplacian
Algebraic methods II dynamic networks
Time dependent graphs
Diffusion on variable graphs
Predictive Information
What in the past is predictive?
The Information Bottleneck method

Algebraic methods I static networks
Connected components and flow bottlenecks
Min-Cut and Max-flow
Currents and potentials on graphs
Minimal energy configuration
Using Kirchhoff's laws
Graph Laplacian
Laplace equation on static graphs
Diffusion on graphs
Spectral decomposition of graphs
Spectral embedding and clustering
Spectral filtering

4
Outline (iii)

Algebraic methods II dynamic networks
Links can be time-dependent
Time dependent graphs
Diffusion on variable graphs
Novelty detection on graphs
Measuring distance between graphs
Autoregressive models of graphs

5
Outline Part II

Statistical and Information Theoretic methods for
Dynamic Network Analysis
Predictive Information
What in the past is predictive?
Predictive information is sub-exponential!
Finding predictive statistics
Prediction suffix trees
The Information Bottleneck method

6
Statement of the problem

Signals, observables and correlations
Examples
Family relations
Social networks (education, joint activities,)
Telephone calls
Sensor arrays (spatial-temporal signals)
Computer networks
....
Neurons in the brain
Biochemical networks
General Co-Occurrence data (words-topics,)
.

7
(No Transcript)
8
Biological neural networks
9
Biochemical interactions
10
Gene expression data

What should one measure?

Gene expression analysis
Genes
Samples
11
(No Transcript)
12
Example
Wireless Sensor Networks
Wireless Transceiver
Sensing and Processing Unit
Battlefield surveillance, disaster relief, border
control, environment monitoring, etc.
13
An Object Moving Through the Network
A moving object corresponds to a spatial peak
moving with time Target tracking corresponds to
determining the peak location over time
14
A Simple Model for the Spatio-Temporal Signal
Field
s(x,y,t) stationary complex Gaussian field in
(x,y,t)
spatial bandwidths in x and y dimensions
and
temporal signal bandwidth
Spatial Coherence Region (SCR)
(coherence distance)
15
Spatial Sampling via Sensors
Distance-bandwidth (DB) products
K total number of nodes
Spatial DoF
Number of independent SCRs
Spatial Coherence Region (SCR)
Number of nodes in each SCR
(Oversampling per DoF)
16
Graph Theoretic Formulation

Signals, observables
? Activity (charge, field) of nodes (V) in a
graph
Correlations, distances, co-occurrence frequency,
? Weights (current, flow) on links (E) in a
graph

i
j
Wi,j
Graph G(V,E) is a pair contains a vector of
nodes (or vertices, V) with a matrix, W, of
non-negative weights on the links (edges, E).
We first assume that the weight matrix is
symmetric Wi,jWj,i
17
Undirected graph ? Symmetric matrix
18
Security Issues

Topology
Identifying critical nodes
Centrality - high degree
Bottleneck max-flow
Strongly connected groups - cliques, min-cuts
Social
find his friends diffusion on the graph
how are they linked? - collective network flow
Temporal
Connectivity changes
Critical times (providing warnings)?
Anomaly prediction

19
Algebraic Methods Static Networks

Functions and optimization on graphs
A simple physical example
Currents and potentials on a network of
conductors
electric potential function defined on each
node ?i
electric conductance of edge Wi,j
Total electric power (energy dissipation rate) of
network
E(?) ½ ?i,j Wi,j (?i - ?j )2
We are interested in the minimal power
configuration
under potential constraints on some nodes
(Kirchhoff's laws)
Min? E(?) - ?k mk ?k , or Min?½?i,j Wi,j
(?i - ?j )2 - ?k mk ?k

20
Algebraic Methods Static Networks

The quadratic energy function can be written as a
quadratic form
E(?) ½?i,j Wi,j (?i - ?j )2 ½?i,j Wi,j (?I2
- 2?i ?j ?j2 ) ?i,j Li,j ?i ?j
Where the Graph Laplacian is defined as
L(G)D-W
with Dj,j?i Wi,j , a diagonal matrix.
The minimum energy potential is the solution of
the linear
Laplace equation on a static graph
L(G)? m
The Lagrange multipliers m are the currents
through the nodes and must sum to zero. This
corresponds to Kirchhoff's laws.

21
Algebraic Methods Static Networks (cont.)

Notice that L(G) is a singular matrix the
constant vector ? is always an eigenvector with
eigenvalue 0.
In fact, the multiplicity of the zero eigenvalue,
dim(ker(L)), is precisely the number of connected
components of the graph!
This is the basis for important algorithms known
as
spectral graph partitioning and spectral
clustering
(Ng, Jordan, Weiss 2000, and others)
Represent data-points by their Laplacian
eigenvectors with low eigenvalues (spectral
embedding)
Apply your favorite Euclidean space clustering
method to this representation.

22
Laplacian eigenvector decomposition

Similar to Fourier decomposition
Orthogonal basis, projectors
Eigenvalues spatial frequencies

L(G) xk ?k xk

X1
X2
X3
23
Application Using Spectral Embedding for
Novelty Detection in communication networks

Represent the connection matrix using its
Laplacian eigenvector decomposition
Filter the high frequency (small) components to
remain with the large connected blocks
Use the filtered connections as a reference and
measure the distance of current graph to the
reference.

24
Reordering the nodes based on Spectral
decomposition
25
Network activity as a Graph
Network can be represented by nodes and arcs,
where the nodes correspond to the sensors and the
arcs to correlations between sensors
26
Simple illustration
27
Distances between graphs

Since graphs are represented by weight matrices
we can use any matrix norm to measure graph
similarity
We can expand the Laplacian in its eigenvectors
Where pi is a projection on the i-th eigenvector
a projection on first K eigenvectors a
low-pass filter!

We compare two networks by the volume of the
overlap between their two K-projections, also
known as the generalized-angle between the two
K-dim subspaces, a and b
or
We use this angle to detect anomalies in the
network
activity.

29
(No Transcript)
30
(No Transcript)
31
Example of the anomaly detection
over 18 days of activity, our measure detected 3
days with significantly low volume overlap with
the reference, turned out to truly Indicate
anomalous activity for this network.
32
Very simple example
40 connectivity matrices were built from 2
different templates with noise. First 20
Template 1, Second 20 Template 2
33
The connectivity graphs
34
The embedded representation of the neurons as a
function of time
35
Applying our method on the example
Reference matrix template 1
Reference matrix template 2
36
Applying our method on the example(finding the
regularity)
37
Correlation coefficients between 8 Gpe neurons,
while a monkey preformed a task.
38
Diffusion on Graphs
The next question who is connected with
whom? This can be modeled as a spread of a
disease through a population, a forest fire, or
ink on paper through diffusion The graph
Laplacian is also the generator for random walks
on the graph, or diffusion. Denoting by
the density on the node x at time t, then
the diffusion equation on the graph
is a linear differential equation,
which is solved by the matrix exponential
39

Computational comment
Matrix exponentials can be calculated very
efficiently
for very large sparse matrices using Krylov
subspaces methods. There is a free software tool
(Expokit) for solving this type of equations in
high dimension.
One may want to include decay (e.g. death of the
virus), sources or sinks. This is simple in this
formulation, by adding a non-homogeneous term to
the diffusion
which can be solved with similar matrix
exponentials.

40
Diffusion on time dependent graphs

When the connection graphs change in time we can
still calculate diffusion very efficiently for
large matrices
The general solution is (for discrete time
changes)
This allows to identify connected nodes due to
many factors in a time dependent and complex
environments.

Predictive Information
What in the past is predictive?
Predictive information is sub-exponential!
Finding predictive statistics
Prediction suffix trees
The Information Bottleneck method
Extracting relevant statistics from large data

42
Why Predictability?

X1, X2, Xn,.
time series extrapolation isdifferent than
prediction
Prediction is Probabilistic

43
Why Predictability? Life is all about
predictions
44
Predictive Information(with Bialek and Nemenman,
2001)
t0
W(t)
future
past
W(-)- T-window
W()- T-window

Estimate PT(W(-),W()) T- past-future
distribution

45
Predictive Information

When looking at a time dependent process, W(t),
what is the information in its past about its
future?

This is a sub-extensive quantity - grows
sub-linearly with the time - window. New Can be
estimated efficiently!
Characterizes the processs complexity.
Grows logarithmically when generated by finite
dim systems, and as a power law for more complex
cases (like natural language or music).

46
Logarithmic growth for finite dimensional
processes

Finite parameter processes (e.g.
Markov chains)
Similar to stochastic complexity (MDL)

47
Power law growth

Such fast growth is a signature of infinite
dimensional processes
Power laws emerges in cases where the
interactions/correlations have long range

48
Entropy of words in a Spin Chain
49
Entropy of 3 Generated Chains
Entropy is Extensive it shows No distinction
between the cases !
50
Predictive Information Subextensive Component
of the Entropy
shows a qualitative distinction between the cases!
Subextensive component growth is reflecting the
underlying complexity!
51
But WHAT - in the past - is predictive ?
t0
W(t)
future
past
W(-)- T-window
W()- T-window

Using the Information Bottleneck method
Solve
MinZ I(W(-)Z) - ? I(W()Z) for all ? gt0
T- past-future information curve
ITF(ITP)
IFuture(IPast) limT! 1 ITF(ITP)

52
IT3F (IT3P)
IT2F (IT2P)
IT1F (IT1P)
IFuture
The limit is always the convex envelope of
increasing time-windows Information Curves
Ipast

Can be calculated analytically for Markov chains,
Gaussian processes, etc., and numerically in
general, in a distribution free way.

53
The Information Bottleneck Method
N. Tishby, F.C. Pereira and W. Bialek, 1999.
54
The IB algorithm generalized Arimoto-Blahut
for RDT

The iteration steps

55
Why is the predictive information
interesting?It determines the level of
adaptation of organisms to stochastic
environments
Interesting Key example Human Languages
56
Variable Memory Markov Modelsand Prediction
Suffix Tree Learning Ron, Singer, Tishby,
1995,96

Can be learned accurately and efficiently
Can capture high-order correlations (better than
HMM)
Effectively capture short features and
motifs

57
Complexity Accuracy Tradeoff
Accuracy
Possible Models/representations
Complexity
58
Simplified Chinese 2.09
Traditional Chinese 1.73
Dutch 2.3
French 2.22
Hebrew 1.63
Italian 2.35
Japanese 1.42
Portuguese 2.9
Spanish 1.89
59
Can we understand it?
60
Many Thanks to