Burst Detection using Shifted Aggregation Tree - PowerPoint PPT Presentation

About This Presentation
Title:

Burst Detection using Shifted Aggregation Tree

Description:

Aggregation Pyramid (AP) N-level pyramid-shape data structure built on a sliding window ... A set of cells in aggregation pyramid within node(t,k)'s coverage ... – PowerPoint PPT presentation

Number of Views:249
Avg rating:3.0/5.0
Slides: 39
Provided by: xinz7
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Burst Detection using Shifted Aggregation Tree


1
Burst Detection using Shifted Aggregation Tree
2
Content
  • Burst Detection Problem and Shifted Binary Tree
  • Aggregation Pyramid and Aggregation Tree
  • Shifted Aggregation Tree (SAT) and Detection
    Algorithm
  • Search for Best SAT Structure
  • Experiments and Discussion

3
Aggregation Pyramid (AP)
  • N-level pyramid-shape data structure built on a
    sliding window of length N
  • Level 1 has N cells containing original data,
    from left to right
  • Level 2 has N-1 cells, storing the aggregates for
    2 consecutive data, i.e, a sliding sub-window of
    length 2
  • Level h has N-h1 cells, storing the aggregates
    for h consecutive data, i.e, a sliding sub-window
    of length h

12-level Aggregation Pyramid
Xin Lots of double counting. This is not a
generalization of SBT. (why having double
counting cant be a generalization? A SAT could
be worse than SBT.)
4
Embed SBT in AP
  • Each node in Shifted Binary Tree is either an
    original data or an aggregate, which is a cell in
    Aggregation Pyramid, SBT can be embedded in AP.
  • Figure shows how each cell in SBT is embedded in
    the cell with the same color in AP.

Embed SBT in AP
5
Aggregation Pyramid as Host Structure
  • Any structure composed of only aggregates and the
    original data can be embedded in AP.
  • It provides a host structure and a tool to
    visualize and compare different data structures.

Another Embedded Structure
6
Aggregation Tree
  • A tree relation defined on a subset cells of
    Aggregation Pyramid containing all the first
    level cells
  • For any cell(t,h) in its domain, if cell(t,h)
    aggregates cell(t1,h1), cell(t2,h2)cell(tn,hn),
    then cell(t,h) is the parent, cell(t1,h1) is the
    first child, , cell(tn,hn) is the nth child.
  • Call cell(t,h) as node(t,k) if cell(t,h) is in
    the kth layer in the aggregation tree. Node(t,k)
    has the height of h in the aggregation pyramid,
    thus corresponds to a window of length h.

7
Notations
The overlap of node(12,3) and node(16,3), i.e.
cell(12,12) and cell(16,12), is cell(12,9), the
dark gray cell.
8
Shifted Aggregation Tree (SAT)
  • Aggregation Tree w/ the following constraints
    (SAT Constraints)
  • Nodes in the same layer have the same sub-tree
    structure, i.e., if node(t,h) aggregates
    node(t1,h1), node(t2,h2), etc, node(ts,h)
    aggregates node(t1s,h1), node(t2s,h2) and so
    on. All the childrens window lengths and
    relative orders in time are same, only shift in
    time.
  • Every node in the same layer shifts the same
    duration in time from its previous node.
  • The shift at layer l is a multiple of the shift
    at layer l-1.
  • The overlap window length of two adjacent nodes
    at layer l is greater than or equal to the window
    length at layer l-1

9
SAT generalizes SBT
10
SAT ExamplesXin does this generalize? (yes)
11
SAT Representation
  • Layer k can be represented by a triple (hk, sk,
    ck1, ck2,ckn), where
  • hk, the corresponding level h in Aggregation
    Pyramid
  • sk, the shift, distance at layer 1 from leftmost
    leaf of node to the leftmost leaf of next node at
    same level.
  • ckl, ck2, ckn, the layers for all its children
  • (hk, sk) in short when no confusion.
  • A SAT can be represented by a list of (hk, sk),
    the first triple represents the first layer and
    so on.
  • For example, (1,1), (2,1), (4,2), (8,4)
    represents a SWT of height 8

12
SAT Properties
  • The overlap window length owlk of two adjacent
    nodes at layer k is hk-sk
  • A window length h between hok-1 2 and hok 1 is
    always covered by a node at layer k

13
SAT Benefits
  • Structure not unique, defining a family of
    structures.
  • Variable (h,s) means variable bounding ratio T,
    controllable false alarm ratio and controllable
    detail search region.
  • Adaptive to the data!

14
Detection Algorithm w/ SAT
  • For each time point t,
  • start from layer 2, i.e., k2
  • while (a window ends) // need update node(t,k)
    (1)
  • update node(t,k) (2)
  • if node(t,k) exceeds the threshold of some length
    h in its Detail Search Region DSR(t,k) (3)
  • detail search DSR(t,k) Needs proof that you want
    to do this now. (4)
  • end
  • go to next layer, k

15
Detail Search Region (DSR)
  • DSR(t,k) the region when node(t,k) updated,
    where to do detail search for real burst
  • A set of cells in aggregation pyramid within
    node(t,k)s coverage
  • Exclude cells before t-sk which have been
    searched by node(t-sk,k), where sk is the shift
    at layer k
  • Exclude cells within DSR(t,k-1) which have been
    searched by node(t,k-1) Xin I dont understand
    this one. This may look for a different
    threshold. (Correct, for the whole coverage, you
    can do detail search, just not necessary if some
    part can be covered by lower nodes)
  • Loose DSR vs Tight DSR

16
Loose DSR
  • hok the overlap window length of 2 adjacent
    nodes at layer k
  • A window length h between hok-1 2 and hok 1 is
    always covered by a node at layer k
  • A quadrilateral shape bounded by t-sk1, t and
    hk-1-sk-12, hk-sk1
  • Used by SBT

17
Tight DSR
  • A window length h between hok-1 2 and hk could
    be covered by a window hk , depending on the
    ending time
  • cell(t-1,hk-1) is covered by node(t,k), but
    cell(t-2, hk-1) is not
  • A m-nary fork shape, msk/sk-1
  • For SBT, m2, i.e, L shape
  • Has the same probability to raise an alarm as
    Loose DSR, but has less detail search cost on
    average, since some cells will be detected by a
    tight bound, especially with bound--prune detail
    search.

18
NaĂŻve Detail Search
  • Search every cell in DSR one by one.
  • Cell(t1,h1)/cell(t-1/h-1) can be computed by
    adding/removing cell(t1,1)/cell(t,1) from
    cell(t,h).
  • Starting from one seed cell, populate the whole
    interested DSR.

19
Grid-based Bound--Prune Detail Search
  • Given node(t,k), i.e., cell(t,hk), there are
    several cells at lower layers having the same
    starting time or the same ending time.
  • By subtracting these cells, we can get some
    intermediate cells within DSR.
  • These cells form a grid and partition DSR.
  • Each cell has its own small DSR, if it doesnt
    exceed the minimal threshold, no need to check
    its DSR.
  • Binary partition DSR, check big DSR first, then
    small DSR.

20
Grid-based Bound--Prune Detail Search
  • By subtracting a lower layer cell starting at the
    same time on the left, we can get a cell with the
    same color on the right.
  • The intermediate cells partitions the L shape
    tight DSR bounded by the red lines
  • cell(28,20) has its DSR bounded by the pink lines

21
Algorithm Complexity
  • Depend on specific SAT structure and the data to
    be detected
  • (Should have an analysis in the average case for
    the best SAT structure for the given data)

22
Search for Best SAT Structure
  • Given the above detection algorithm and the data
    to be detected, the best SAT structure minimizes
    the total running time.
  • Two major parts in the detection algorithm
  • updating the SAT structure, step (1) and (2)
  • detail searching DSR, step (3) and (4)
  • Best SAT structure balances between the updating
    time and the searching time.

23
Optimization Goal
  • The goal is to minimize
  • But how to quantitate the updating time and the
    searching time?
  • Theoretical method estimate number of
    cell-access
  • Experiment method count machine time on sample
    data

24
Estimate time by Number of Cell-access
  • Most part of the algorithm, i.e. step (2),(3),(4)
    are to access either the original data or an
    aggregate which is the same type as original
    data.
  • The number of cell-access implies how many
    operations are executed, thus an estimation of
    the running time.
  • Use the expected number of cell-access as the
    measurement of the expected running time.
  • Step (1) has the same number of execution as step
    (2), counted in step (2) by multiplying a weight
    learned from experiment.

25
Cell-access of node(t,l)
  • The updating step (2) requires to read all its
    children and write the aggregate to node(t,l),
    thus the number of cell-access is
    sizeof(children) 1
  • Step (3) can be done by a binary search to
    compare node(t,l) against the interested
    thresholds within DSR, requiring log2W1
    comparison, where W is the number of interested
    window lengths in this DSR.

26
Cell-access of node(t,l) (2)
  • For naĂŻve detail search, to check one cell
    requires 4 cell-access (2 read, 1 write, 1
    comparison against the threshold).
  • The probability to search a cell(t,h) is the
    probability to raise an alarm at level h given
    its covering node. Probalarm can be learned from
    sample data.
  • Total number of cell-access is

27
Time Cost of a SAT
  • Cost(SAT)
  • For theoretical method
  • For experiment method actual machine time to
    test on sample data with this SAT
  • Normalized Cost(SAT)
  • Where t is the total time points when counting or
    testing, and hL is the window length of the top
    layer
  • Comparable value for different SAT with different
    time cycles and different max window lengths
  • Best SAT is the one with minimal normalized cost
    which can cover the max interested window length
    N.

28
H-S grid as a search space
  • Map layer (h, s) to a grid cell in a regular h-s
    grid, origin at (1,1).
  • Link these grid cells by the SAT list order, a
    SAT can be mapped to a path in the h-s grid.
  • Shown two SAT paths
  • (1,1)(3,3)(6,3)(12,3) in red
  • (1,1)(2,2)(4,2)(8,4)(12,4) in blue
  • Grayed cells dont satisfy SAT constraints,
    because h lt s

h-s grid and 2 SAT paths
29
Best SAT as shortest path
  • Any path ending above the line h-s1N, can at
    least cover window length N, thus no need for
    another layer.
  • Call the grid cells above the line h-s1N as
    final states, Best SAT is the shortest path
    starting from (1,1) to one of the final states
    and satisfying SAT constraints.
  • Multiple Single-Source-Shortest-Path (SSSP)
    problems between (1,1) and one of final states in
    a directed graph.

30
SSSP w/ Constraints
  • Not all the paths between (1,1) and a final state
    are legal, i.e, satisfying SAT constraints.
  • Unlimited final states, when to stop?
  • Large graph if considering all the possible edges
    between layer l-1 and layer l, many of them are
    not likely even in a good SAT, say
    (1,1)(900,900) is not likely in a SAT covering
    length 1000.

31
Best-first Graph Search
  • Best-first search in a dynamically-generated
    directed graph
  • Dynamically add edges to the graph for the node
    with best normalized cost
  • Guarantee all the paths are legal
  • Avoid to generate the unlikely edges
  • Stop searching after reaching M final states.
  • Dijastra-style cost update to track the shortest
    path
  • If the cost of a new path ending at current node
    is better than the cost kept in this node, update
    this node with the new cost

32
Search Algorithm
  • insert (1,1) into the graph and a heap based on
    the normalized cost
  • while the heap is not empty
  • pop the first node in the heap
  • if its a final node
  • if its already in the graph
  • if the new cost is better than the cost stored,
    update the cost
  • else insert the node into the graph and store its
    cost
  • if already reach M final nodes, stop
  • else
  • generate a set of next possible nodes for this
    node
  • for each next possible node
  • evaluate the normalized cost
  • if the new node is in the graph
  • If the new cost is better than the cost stored,
    update the cost
  • else insert the node into the graph and store its
    cost
  • insert this node into the heap
  • end while
  • output the best node and corresponding path with
    the best cost

33
Experiments
  • Test data random normal-distributed N(6,1), 1M
  • Sample data 20K
  • Interested windows are all the windows from
    length l up to the max length N
  • For window length h, the threshold Th is set to
  • where bp is the real burst probability, ? is the
    normal cumulative function

34
SAT w/ naĂŻve detail search vs SBT
  • Total running time in machine clock
  • SAT_EP SAT found using experiment time cost
  • SAT_TH SAT found using theoretical time cost
  • SBT Shifted Binary Tree

35
SAT w/ naĂŻve detail search vs SBT (2)
36
Alarm probability is always large
  • In the testing data, even the real burst
    probability Probtrue is 0.0001, the probability
    for a window length l to exceed the threshold for
    length l-d, increases rapidly even d increases a
    little.
  • Figure right shows this probability for all the
    window length pairs less than 100 with
    Probtrue0.0001.
  • When Probtrue is considerably large, updating
    time decides the SAT structure, since the
    detecting time are very close.

37
Discussion
  • SAT_EP is sensitive to CPU time, since it depends
    on the testing time on a small amount of sample
    data. Different runs may give different SAT
    structures.
  • Its stable in the sense it always finds some SAT
    better than SBT.
  • SAT_TH doesnt give an accurate estimation of
    actual running time.
  • But when Probalarm is large, it produces better
    comparison of updating cost between different
    SAT, thus better result.

38
SAT w/ naĂŻve detail search vs SBT (3)
  • Breakdown time in machine clock for SAT_EP
Write a Comment
User Comments (0)
About PowerShow.com