Burst Detection using Shifted Aggregation Tree - PowerPoint PPT Presentation

About This Presentation

Title:

Burst Detection using Shifted Aggregation Tree

Description:

Aggregation Pyramid (AP) N-level pyramid-shape data structure built on a sliding window ... A set of cells in aggregation pyramid within node(t,k)'s coverage ... – PowerPoint PPT presentation

Number of Views:252

Avg rating:3.0/5.0

Slides: 39

Provided by: xinz7

Learn more at: https://cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Burst Detection using Shifted Aggregation Tree

1
Burst Detection using Shifted Aggregation Tree
2
Content

Burst Detection Problem and Shifted Binary Tree
Aggregation Pyramid and Aggregation Tree
Shifted Aggregation Tree (SAT) and Detection
Algorithm
Search for Best SAT Structure
Experiments and Discussion

3
Aggregation Pyramid (AP)

N-level pyramid-shape data structure built on a
sliding window of length N
Level 1 has N cells containing original data,
from left to right
Level 2 has N-1 cells, storing the aggregates for
2 consecutive data, i.e, a sliding sub-window of
length 2
Level h has N-h1 cells, storing the aggregates
for h consecutive data, i.e, a sliding sub-window
of length h

12-level Aggregation Pyramid
Xin Lots of double counting. This is not a
generalization of SBT. (why having double
counting cant be a generalization? A SAT could
be worse than SBT.)
4
Embed SBT in AP

Each node in Shifted Binary Tree is either an
original data or an aggregate, which is a cell in
Aggregation Pyramid, SBT can be embedded in AP.
Figure shows how each cell in SBT is embedded in
the cell with the same color in AP.

Embed SBT in AP
5
Aggregation Pyramid as Host Structure

Any structure composed of only aggregates and the
original data can be embedded in AP.
It provides a host structure and a tool to
visualize and compare different data structures.

Another Embedded Structure
6
Aggregation Tree

A tree relation defined on a subset cells of
Aggregation Pyramid containing all the first
level cells
For any cell(t,h) in its domain, if cell(t,h)
aggregates cell(t1,h1), cell(t2,h2)cell(tn,hn),
then cell(t,h) is the parent, cell(t1,h1) is the
first child, , cell(tn,hn) is the nth child.
Call cell(t,h) as node(t,k) if cell(t,h) is in
the kth layer in the aggregation tree. Node(t,k)
has the height of h in the aggregation pyramid,
thus corresponds to a window of length h.

7
Notations
The overlap of node(12,3) and node(16,3), i.e.
cell(12,12) and cell(16,12), is cell(12,9), the
dark gray cell.
8
Shifted Aggregation Tree (SAT)

Aggregation Tree w/ the following constraints
(SAT Constraints)
Nodes in the same layer have the same sub-tree
structure, i.e., if node(t,h) aggregates
node(t1,h1), node(t2,h2), etc, node(ts,h)
aggregates node(t1s,h1), node(t2s,h2) and so
on. All the childrens window lengths and
relative orders in time are same, only shift in
time.
Every node in the same layer shifts the same
duration in time from its previous node.
The shift at layer l is a multiple of the shift
at layer l-1.
The overlap window length of two adjacent nodes
at layer l is greater than or equal to the window
length at layer l-1

9
SAT generalizes SBT
10
SAT ExamplesXin does this generalize? (yes)
11
SAT Representation

Layer k can be represented by a triple (hk, sk,
ck1, ck2,ckn), where
hk, the corresponding level h in Aggregation
Pyramid
sk, the shift, distance at layer 1 from leftmost
leaf of node to the leftmost leaf of next node at
same level.
ckl, ck2, ckn, the layers for all its children
(hk, sk) in short when no confusion.
A SAT can be represented by a list of (hk, sk),
the first triple represents the first layer and
so on.
For example, (1,1), (2,1), (4,2), (8,4)
represents a SWT of height 8

12
SAT Properties

The overlap window length owlk of two adjacent
nodes at layer k is hk-sk
A window length h between hok-1 2 and hok 1 is
always covered by a node at layer k

13
SAT Benefits

Structure not unique, defining a family of
structures.
Variable (h,s) means variable bounding ratio T,
controllable false alarm ratio and controllable
detail search region.
Adaptive to the data!

14
Detection Algorithm w/ SAT

For each time point t,
start from layer 2, i.e., k2
while (a window ends) // need update node(t,k)
(1)
update node(t,k) (2)
if node(t,k) exceeds the threshold of some length
h in its Detail Search Region DSR(t,k) (3)
detail search DSR(t,k) Needs proof that you want
to do this now. (4)
end
go to next layer, k

15
Detail Search Region (DSR)

DSR(t,k) the region when node(t,k) updated,
where to do detail search for real burst
A set of cells in aggregation pyramid within
node(t,k)s coverage
Exclude cells before t-sk which have been
searched by node(t-sk,k), where sk is the shift
at layer k
Exclude cells within DSR(t,k-1) which have been
searched by node(t,k-1) Xin I dont understand
this one. This may look for a different
threshold. (Correct, for the whole coverage, you
can do detail search, just not necessary if some
part can be covered by lower nodes)
Loose DSR vs Tight DSR

16
Loose DSR

hok the overlap window length of 2 adjacent
nodes at layer k
A window length h between hok-1 2 and hok 1 is
always covered by a node at layer k
A quadrilateral shape bounded by t-sk1, t and
hk-1-sk-12, hk-sk1
Used by SBT

17
Tight DSR

A window length h between hok-1 2 and hk could
be covered by a window hk , depending on the
ending time
cell(t-1,hk-1) is covered by node(t,k), but
cell(t-2, hk-1) is not
A m-nary fork shape, msk/sk-1
For SBT, m2, i.e, L shape
Has the same probability to raise an alarm as
Loose DSR, but has less detail search cost on
average, since some cells will be detected by a
tight bound, especially with bound--prune detail
search.

18
Naïve Detail Search

Search every cell in DSR one by one.
Cell(t1,h1)/cell(t-1/h-1) can be computed by
adding/removing cell(t1,1)/cell(t,1) from
cell(t,h).
Starting from one seed cell, populate the whole
interested DSR.

19
Grid-based Bound--Prune Detail Search

Given node(t,k), i.e., cell(t,hk), there are
several cells at lower layers having the same
starting time or the same ending time.
By subtracting these cells, we can get some
intermediate cells within DSR.
These cells form a grid and partition DSR.
Each cell has its own small DSR, if it doesnt
exceed the minimal threshold, no need to check
its DSR.
Binary partition DSR, check big DSR first, then
small DSR.

20
Grid-based Bound--Prune Detail Search

By subtracting a lower layer cell starting at the
same time on the left, we can get a cell with the
same color on the right.

The intermediate cells partitions the L shape
tight DSR bounded by the red lines
cell(28,20) has its DSR bounded by the pink lines

21
Algorithm Complexity

Depend on specific SAT structure and the data to
be detected
(Should have an analysis in the average case for
the best SAT structure for the given data)

22
Search for Best SAT Structure

Given the above detection algorithm and the data
to be detected, the best SAT structure minimizes
the total running time.
Two major parts in the detection algorithm
updating the SAT structure, step (1) and (2)
detail searching DSR, step (3) and (4)
Best SAT structure balances between the updating
time and the searching time.

23
Optimization Goal

The goal is to minimize
But how to quantitate the updating time and the
searching time?
Theoretical method estimate number of
cell-access
Experiment method count machine time on sample
data

24
Estimate time by Number of Cell-access

Most part of the algorithm, i.e. step (2),(3),(4)
are to access either the original data or an
aggregate which is the same type as original
data.
The number of cell-access implies how many
operations are executed, thus an estimation of
the running time.
Use the expected number of cell-access as the
measurement of the expected running time.
Step (1) has the same number of execution as step
(2), counted in step (2) by multiplying a weight
learned from experiment.

25
Cell-access of node(t,l)

The updating step (2) requires to read all its
children and write the aggregate to node(t,l),
thus the number of cell-access is
sizeof(children) 1
Step (3) can be done by a binary search to
compare node(t,l) against the interested
thresholds within DSR, requiring log2W1
comparison, where W is the number of interested
window lengths in this DSR.

26
Cell-access of node(t,l) (2)

For naïve detail search, to check one cell
requires 4 cell-access (2 read, 1 write, 1
comparison against the threshold).
The probability to search a cell(t,h) is the
probability to raise an alarm at level h given
its covering node. Probalarm can be learned from
sample data.
Total number of cell-access is

27
Time Cost of a SAT

Cost(SAT)
For theoretical method
For experiment method actual machine time to
test on sample data with this SAT
Normalized Cost(SAT)
Where t is the total time points when counting or
testing, and hL is the window length of the top
layer
Comparable value for different SAT with different
time cycles and different max window lengths
Best SAT is the one with minimal normalized cost
which can cover the max interested window length
N.

28
H-S grid as a search space

Map layer (h, s) to a grid cell in a regular h-s
grid, origin at (1,1).
Link these grid cells by the SAT list order, a
SAT can be mapped to a path in the h-s grid.
Shown two SAT paths
(1,1)(3,3)(6,3)(12,3) in red
(1,1)(2,2)(4,2)(8,4)(12,4) in blue
Grayed cells dont satisfy SAT constraints,
because h lt s

h-s grid and 2 SAT paths
29
Best SAT as shortest path

Any path ending above the line h-s1N, can at
least cover window length N, thus no need for
another layer.
Call the grid cells above the line h-s1N as
final states, Best SAT is the shortest path
starting from (1,1) to one of the final states
and satisfying SAT constraints.
Multiple Single-Source-Shortest-Path (SSSP)
problems between (1,1) and one of final states in
a directed graph.

30
SSSP w/ Constraints

Not all the paths between (1,1) and a final state
are legal, i.e, satisfying SAT constraints.
Unlimited final states, when to stop?
Large graph if considering all the possible edges
between layer l-1 and layer l, many of them are
not likely even in a good SAT, say
(1,1)(900,900) is not likely in a SAT covering
length 1000.

31
Best-first Graph Search

Best-first search in a dynamically-generated
directed graph
Dynamically add edges to the graph for the node
with best normalized cost
Guarantee all the paths are legal
Avoid to generate the unlikely edges
Stop searching after reaching M final states.
Dijastra-style cost update to track the shortest
path
If the cost of a new path ending at current node
is better than the cost kept in this node, update
this node with the new cost

32
Search Algorithm

insert (1,1) into the graph and a heap based on
the normalized cost
while the heap is not empty
pop the first node in the heap
if its a final node
if its already in the graph
if the new cost is better than the cost stored,
update the cost
else insert the node into the graph and store its
cost
if already reach M final nodes, stop
else
generate a set of next possible nodes for this
node
for each next possible node
evaluate the normalized cost
if the new node is in the graph
If the new cost is better than the cost stored,
update the cost
else insert the node into the graph and store its
cost
insert this node into the heap
end while
output the best node and corresponding path with
the best cost

33
Experiments

Test data random normal-distributed N(6,1), 1M
Sample data 20K
Interested windows are all the windows from
length l up to the max length N
For window length h, the threshold Th is set to
where bp is the real burst probability, ? is the
normal cumulative function

34
SAT w/ naïve detail search vs SBT

Total running time in machine clock
SAT_EP SAT found using experiment time cost
SAT_TH SAT found using theoretical time cost
SBT Shifted Binary Tree

35
SAT w/ naïve detail search vs SBT (2)
36
Alarm probability is always large

In the testing data, even the real burst
probability Probtrue is 0.0001, the probability
for a window length l to exceed the threshold for
length l-d, increases rapidly even d increases a
little.

Figure right shows this probability for all the
window length pairs less than 100 with
Probtrue0.0001.
When Probtrue is considerably large, updating
time decides the SAT structure, since the
detecting time are very close.

37
Discussion

SAT_EP is sensitive to CPU time, since it depends
on the testing time on a small amount of sample
data. Different runs may give different SAT
structures.
Its stable in the sense it always finds some SAT
better than SBT.
SAT_TH doesnt give an accurate estimation of
actual running time.
But when Probalarm is large, it produces better
comparison of updating cost between different
SAT, thus better result.

38
SAT w/ naïve detail search vs SBT (3)