Naftali Tishby - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Naftali Tishby

Description:

0. Naftali Tishby. School of Computer Science and Engineering. Hebrew ... with Dj,j= i Wi,j , ... for more complex cases (like natural language or music) ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 48
Provided by: AkbarS9
Category:
Tags: dj | family | hebrew | music | naftali | tishby | tree | words

less

Transcript and Presenter's Notes

Title: Naftali Tishby


1
Algebraic and Information Theoretic Methods for
Network Anomaly Detection
  • Naftali Tishby
  • School of Computer Science and Engineering
  • Hebrew University of Jerusalem
  • tishby_at_cs.huji.ac.il
  • http//www.cs.huji.ac.il/tishby

NATO ASI School, Villa Cangola, Gazzada, Italy
2
Outline
  • Statement of the problem
  • Signals, observables, correlations and graphs
  • Common Security Issues
  • Topology
  • critical nodes, connected groups
  • Social
  • find his friends, how are they linked?
  • Temporal
  • Anomaly detection
  • providing warnings (predicting events)
  • Algebraic methods I - static networks
  • using graph the Laplacian
  • Algebraic methods II dynamic networks
  • Time dependent graphs
  • Diffusion on variable graphs
  • Predictive Information
  • What in the past is predictive?
  • The Information Bottleneck method

3
  • Algebraic methods I static networks
  • Connected components and flow bottlenecks
  • Min-Cut and Max-flow
  • Currents and potentials on graphs
  • Minimal energy configuration
  • Using Kirchhoff's laws
  • Graph Laplacian
  • Laplace equation on static graphs
  • Diffusion on graphs
  • Spectral decomposition of graphs
  • Spectral embedding and clustering
  • Spectral filtering

4
Outline (iii)
  • Algebraic methods II dynamic networks
  • Links can be time-dependent
  • Time dependent graphs
  • Diffusion on variable graphs
  • Novelty detection on graphs
  • Measuring distance between graphs
  • Autoregressive models of graphs

5
Outline Part II
  • Statistical and Information Theoretic methods for
    Dynamic Network Analysis
  • Predictive Information
  • What in the past is predictive?
  • Predictive information is sub-exponential!
  • Finding predictive statistics
  • Prediction suffix trees
  • The Information Bottleneck method

6
Statement of the problem
  • Signals, observables and correlations
  • Examples
  • Family relations
  • Social networks (education, joint activities,)
  • Telephone calls
  • Sensor arrays (spatial-temporal signals)
  • Computer networks
  • ....
  • Neurons in the brain
  • Biochemical networks
  • General Co-Occurrence data (words-topics,)
  • .

7
(No Transcript)
8
Biological neural networks
9
Biochemical interactions
10
Gene expression data
  • What should one measure?

Gene expression analysis
Genes
Samples
11
(No Transcript)
12
Example
Wireless Sensor Networks
Wireless Transceiver
Sensing and Processing Unit
Battlefield surveillance, disaster relief, border
control, environment monitoring, etc.
13
An Object Moving Through the Network
A moving object corresponds to a spatial peak
moving with time Target tracking corresponds to
determining the peak location over time
14
A Simple Model for the Spatio-Temporal Signal
Field
s(x,y,t) stationary complex Gaussian field in
(x,y,t)
spatial bandwidths in x and y dimensions
and
temporal signal bandwidth
Spatial Coherence Region (SCR)
(coherence distance)
15
Spatial Sampling via Sensors
Distance-bandwidth (DB) products
K total number of nodes
Spatial DoF
Number of independent SCRs
Spatial Coherence Region (SCR)
Number of nodes in each SCR
(Oversampling per DoF)
16
Graph Theoretic Formulation
  • Signals, observables
  • ? Activity (charge, field) of nodes (V) in a
    graph
  • Correlations, distances, co-occurrence frequency,
  • ? Weights (current, flow) on links (E) in a
    graph

i
j
Wi,j
Graph G(V,E) is a pair contains a vector of
nodes (or vertices, V) with a matrix, W, of
non-negative weights on the links (edges, E).
We first assume that the weight matrix is
symmetric Wi,jWj,i
17
Undirected graph ? Symmetric matrix
18
Security Issues
  • Topology
  • Identifying critical nodes
  • Centrality - high degree
  • Bottleneck max-flow
  • Strongly connected groups - cliques, min-cuts
  • Social
  • find his friends diffusion on the graph
  • how are they linked? - collective network flow
  • Temporal
  • Connectivity changes
  • Critical times (providing warnings)?
  • Anomaly prediction

19
Algebraic Methods Static Networks
  • Functions and optimization on graphs
  • A simple physical example
  • Currents and potentials on a network of
    conductors
  • electric potential function defined on each
    node ?i
  • electric conductance of edge Wi,j
  • Total electric power (energy dissipation rate) of
    network
  • E(?) ½ ?i,j Wi,j (?i - ?j )2
  • We are interested in the minimal power
    configuration
  • under potential constraints on some nodes
    (Kirchhoff's laws)
  • Min? E(?) - ?k mk ?k , or Min?½?i,j Wi,j
    (?i - ?j )2 - ?k mk ?k

20
Algebraic Methods Static Networks
  • The quadratic energy function can be written as a
    quadratic form
  • E(?) ½?i,j Wi,j (?i - ?j )2 ½?i,j Wi,j (?I2
    - 2?i ?j ?j2 ) ?i,j Li,j ?i ?j
  • Where the Graph Laplacian is defined as
  • L(G)D-W
  • with Dj,j?i Wi,j , a diagonal matrix.
  • The minimum energy potential is the solution of
    the linear
  • Laplace equation on a static graph
  • L(G)? m
  • The Lagrange multipliers m are the currents
    through the nodes and must sum to zero. This
    corresponds to Kirchhoff's laws.

21
Algebraic Methods Static Networks (cont.)
  • Notice that L(G) is a singular matrix the
    constant vector ? is always an eigenvector with
    eigenvalue 0.
  • In fact, the multiplicity of the zero eigenvalue,
    dim(ker(L)), is precisely the number of connected
    components of the graph!
  • This is the basis for important algorithms known
    as
  • spectral graph partitioning and spectral
    clustering
  • (Ng, Jordan, Weiss 2000, and others)
  • Represent data-points by their Laplacian
    eigenvectors with low eigenvalues (spectral
    embedding)
  • Apply your favorite Euclidean space clustering
    method to this representation.

22
Laplacian eigenvector decomposition
  • Similar to Fourier decomposition
  • Orthogonal basis, projectors
  • Eigenvalues spatial frequencies

L(G) xk ?k xk

X1
X2
X3
23
Application Using Spectral Embedding for
Novelty Detection in communication networks
  • Represent the connection matrix using its
    Laplacian eigenvector decomposition
  • Filter the high frequency (small) components to
    remain with the large connected blocks
  • Use the filtered connections as a reference and
    measure the distance of current graph to the
    reference.

24
Reordering the nodes based on Spectral
decomposition
25
Network activity as a Graph
Network can be represented by nodes and arcs,
where the nodes correspond to the sensors and the
arcs to correlations between sensors
26
Simple illustration
27
Distances between graphs
  • Since graphs are represented by weight matrices
    we can use any matrix norm to measure graph
    similarity
  • We can expand the Laplacian in its eigenvectors
  • Where pi is a projection on the i-th eigenvector
  • a projection on first K eigenvectors a
    low-pass filter!

28
  • We compare two networks by the volume of the
    overlap between their two K-projections, also
    known as the generalized-angle between the two
    K-dim subspaces, a and b
  • or
  • We use this angle to detect anomalies in the
    network
  • activity.

29
(No Transcript)
30
(No Transcript)
31
Example of the anomaly detection
over 18 days of activity, our measure detected 3
days with significantly low volume overlap with
the reference, turned out to truly Indicate
anomalous activity for this network.
32
Very simple example
40 connectivity matrices were built from 2
different templates with noise. First 20
Template 1, Second 20 Template 2
33
The connectivity graphs
34
The embedded representation of the neurons as a
function of time
35
Applying our method on the example
Reference matrix template 1
Reference matrix template 2
36
Applying our method on the example(finding the
regularity)
37
Correlation coefficients between 8 Gpe neurons,
while a monkey preformed a task.
38
Diffusion on Graphs
The next question who is connected with
whom? This can be modeled as a spread of a
disease through a population, a forest fire, or
ink on paper through diffusion The graph
Laplacian is also the generator for random walks
on the graph, or diffusion. Denoting by
the density on the node x at time t, then
the diffusion equation on the graph
is a linear differential equation,
which is solved by the matrix exponential
39
  • Computational comment
  • Matrix exponentials can be calculated very
    efficiently
  • for very large sparse matrices using Krylov
    subspaces methods. There is a free software tool
    (Expokit) for solving this type of equations in
    high dimension.
  • One may want to include decay (e.g. death of the
    virus), sources or sinks. This is simple in this
    formulation, by adding a non-homogeneous term to
    the diffusion
  • which can be solved with similar matrix
    exponentials.

40
Diffusion on time dependent graphs
  • When the connection graphs change in time we can
    still calculate diffusion very efficiently for
    large matrices
  • The general solution is (for discrete time
    changes)
  • This allows to identify connected nodes due to
    many factors in a time dependent and complex
    environments.

41
  • Predictive Information
  • What in the past is predictive?
  • Predictive information is sub-exponential!
  • Finding predictive statistics
  • Prediction suffix trees
  • The Information Bottleneck method
  • Extracting relevant statistics from large data

42
Why Predictability?
  • X1, X2, Xn,.
  • time series extrapolation isdifferent than
    prediction
  • Prediction is Probabilistic

43
Why Predictability? Life is all about
predictions
44
Predictive Information(with Bialek and Nemenman,
2001)
t0
W(t)
future
past
W(-)- T-window
W()- T-window
  • Estimate PT(W(-),W()) T- past-future
    distribution

45
Predictive Information
  • When looking at a time dependent process, W(t),
    what is the information in its past about its
    future?
  • This is a sub-extensive quantity - grows
    sub-linearly with the time - window. New Can be
    estimated efficiently!
  • Characterizes the processs complexity.
  • Grows logarithmically when generated by finite
    dim systems, and as a power law for more complex
    cases (like natural language or music).

46
Logarithmic growth for finite dimensional
processes
  • Finite parameter processes (e.g.
    Markov chains)
  • Similar to stochastic complexity (MDL)

47
Power law growth
  • Such fast growth is a signature of infinite
    dimensional processes
  • Power laws emerges in cases where the
    interactions/correlations have long range

48
Entropy of words in a Spin Chain
49
Entropy of 3 Generated Chains
Entropy is Extensive it shows No distinction
between the cases !
50
Predictive Information Subextensive Component
of the Entropy
shows a qualitative distinction between the cases!
Subextensive component growth is reflecting the
underlying complexity!
51
But WHAT - in the past - is predictive ?
t0
W(t)
future
past
W(-)- T-window
W()- T-window
  • Using the Information Bottleneck method
  • Solve
  • MinZ I(W(-)Z) - ? I(W()Z) for all ? gt0
  • T- past-future information curve
    ITF(ITP)
  • IFuture(IPast) limT! 1 ITF(ITP)

52
IT3F (IT3P)
IT2F (IT2P)
IT1F (IT1P)
IFuture
The limit is always the convex envelope of
increasing time-windows Information Curves
Ipast
  • Can be calculated analytically for Markov chains,
    Gaussian processes, etc., and numerically in
    general, in a distribution free way.

53
The Information Bottleneck Method
N. Tishby, F.C. Pereira and W. Bialek, 1999.
54
The IB algorithm generalized Arimoto-Blahut
for RDT
  • The iteration steps

55
Why is the predictive information
interesting?It determines the level of
adaptation of organisms to stochastic
environments
Interesting Key example Human Languages
56
Variable Memory Markov Modelsand Prediction
Suffix Tree Learning Ron, Singer, Tishby,
1995,96
  • Can be learned accurately and efficiently
  • Can capture high-order correlations (better than
    HMM)
  • Effectively capture short features and
    motifs

57
Complexity Accuracy Tradeoff
Accuracy
Possible Models/representations
Complexity
58
Simplified Chinese 2.09
Traditional Chinese 1.73
Dutch 2.3
French 2.22
Hebrew 1.63
Italian 2.35
Japanese 1.42
Portuguese 2.9
Spanish 1.89
59
Can we understand it?
60
Many Thanks to
  • Bill Bialek
  • Ilya Nemenman
  • Naama Parush
  • Jonathan Rubin
  • Eli Nelken
  • Dmitry Davidov
  • Felix Creutzig
  • Amir Globerson
  • Gal Chechik
  • Roi Weiss
Write a Comment
User Comments (0)
About PowerShow.com