Mining Outliers in Large Datasets - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Mining Outliers in Large Datasets

Description:

To minimize page reads, read Class A pages, then Class B pages, then re-read Class A pages ... Thus page j needs to be re-read. ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 35
Provided by: hunge2
Category:

less

Transcript and Presenter's Notes

Title: Mining Outliers in Large Datasets


1
Mining Outliers in Large Datasets
  • Database Group Seminar
  • by Edward Hung
  • 2pm - 330pm, Sep 25 1998
  • CYC 313

2
Content
  • Introduction
  • Nested-Loop Algorithm (NL)
  • Cell-Based Approach
  • Memory-Resident (FindAllOutsM)
  • 2-D, Higher Dimensions
  • Disk-Resident (FindAllOutsD)
  • Comparisions
  • Future Works

3
Introduction
  • Algorithms for Mining Distance-Based Outliers in
    Large Datasets
  • Edwin M. Knorr and Raymond T. Ng
  • Definition
  • An Object O in a dataset T is a DB(p,D)-outlier
    if at least fraction p of the objects in T lies
    greater than distance D from O
  • maxmium number of objects within distance D from
    an outlier O, i.e. M N (1 - p)

4
Introduction (cont)
  • Applications
  • electronic commerce (low-value transactions
    expected)
  • detection of credit card fraud
  • monitoring of criminal activities
  • Related Works
  • distribution-based (need to know distribution)
  • clustering algorithms (not optimised)

5
Nested-Loop (NL)Algorithm
6
Nested-Loop (NL)
  • NL algorithm block oriented, nested-loop
  • total buffer size B of dataset size
  • total buffer gt 2 halves (first and second
    arrays)
  • dataset is read into arays distance between each
    pair of tuples is computed
  • for each tuple t in first array, a count of its
    D-neighbours is maintained

7
NL Algorithm
  • Algorithm
  • fill first array (size B/2 of dataset)
  • for each tuple t in the first array
  • count how many tuples in first array are close to
    t (distance lt D)
  • Repeat until all blocks are compared to first
    array
  • fill second array with another block
  • for each tuple t in the first array
  • increse the count by the number of tuples in
    first array close to t (distance lt D)
  • report tuples with count lt M as outliers (cont)

8
NL Algorithm (cont)
  • if second array has served as first array before,
    stop otherwise, swap the names of first and
    second arrays and repeat the above
  • complexity O(kN2)
  • blocks read n (n-2)(n-1)
  • dataset pass (n (n-2)(n-1))/n n - 2 2/n

9
NL Example
A
B
C
D
Dataset
A D
C B
A
A B
A C
A D
B D
C D
C D
C A
C B
D B
A B
Buffer
10
Cell-Based Approach
  • Cell Structures and Properties
  • Memory-Resident (FindAllOutsM)
  • Disk-Resident (FindAllOutsD)

11
Cell Strucures
  • For 2-D, the space is partitioned into cells or
    squares of length
  • Layer 1 (L1) neighbours of Cx,y
  • Layer 2 (L2) neighbours of Cx,y

12
Cell Strucures
Cx,y
L1(Cx,y)
L2(Cx,y)
13
Cell Properties
  • Property 1Any pair of objects within the same
    cell is at most distance D/2 apart

D/2
Cx,y
14
Cell Properties (cont)
  • Property 2If Cu,v is an L1 neighbour of Cx,y,
    then any object P Cu,v, and any object Q
    Cx,y are at most distance D apart

Cx,y
D
L1(Cx,y)
15
Cell Properties (cont)
  • Property 3If Cu,v Cx,y is neither an L1 nor
    an L2 neighbour of Cx,y, then any object P
    Cu,v, and any object Q Cx,y must be gt distance
    D apart

Cx,y
...
L1(Cx,y)
L2(Cx,y)

16
Cell Properties (cont)
  • Still remember the definition?
  • An Object O in a dataset T is a DB(p,D)-outlier
    if at least fraction p of the objects in T lies
    greater than distance D from O
  • maxmium number of objects within distance D from
    an outlier O, i.e. M N (1 - p)

17
Cell Properties (cont)
  • Property 4
  • (a) If there are gt M objects in Cx,y, none of the
    objects in Cx,y is an outlier
  • (b) If there are gt M objects in Cx,y U L1(Cx,y ),
    none of the objects in Cx,y is an outlier
  • (c) If there are lt M objects in Cx,y U L1(Cx,y )
    U L2(Cx,y ), every object in Cx,y is an outlier

18
FindAllOutsM Algorithm (Memory-resident Dataset)
  • initialise m counts (m is number of cells) (O(m))
  • for each object P, map P to an appropriate cell
    Cq, store P, and increment Countq by 1 (O(N))
  • for each cell, if Count gt M, label cell red
    (O(m))(a red cell has at least M1 objects, so
    there are at most N/(m1) red cells)
  • for each red cell, label its each of L1 neighbour
    pink, if it is not red (O(N/(m1))) (co
    nt)

19
FindAllOutsM Algorithm (cont)
  • for each non-empty white cell Cw, do
  • if Count of Cw L1 gt M, label Cw pink, otherwise
  • if Count of Cw L1 L2 lt M, mark all objects in
    Cw as outliers
  • else for each object P in Cw, do
  • if the number of objects close to P (distance lt
    D) gt M, mark P as an outlier
  • (worst case m cells, each has M objects O(mM2)
    but MN(1-p) and p is expected to be extremely
    close to 1, so O(mN2(1-p)2)approximated by O(m))
  • Complexity is O(mN)

20
FindAllOutsM for Higher Dimensions
  • for k dimensions, the space is partitioned into
    general k-D cells of length

21
FindAllOutsM for Higher Dimensions (cont)
  • All properties 1 to 4 are preserved
  • m (number of cells) is exponential w.r.t. k
    (number of dimensions)
  • complexity of last step in algorithm changes from
    O(m) to
  • total complexity O(ck N)

22
FindAllOutsD
  • Dataset is Disk-Resident
  • Goal is to minimize the number of page reads
  • 2 phases where page reads are needed
  • initial mapping of each object to a cell
  • object-pairwise object-by-object distance
    calculation

23
FindAllOutsD (cont)
  • all pages are classified into three categories
  • A. Pages that contain some white tuples
  • B. Pages that do not contain any white tuples,
    but contain tuples mapped to a non-white cell
    which is an L2 neighbour of some white cell
  • C. All other pages
  • To minimize page reads, read Class A pages, then
    Class B pages, then re-read Class A pages

24
FindAllOutsD (cont)
  • Suppose tuple P is mapped to white cell Cw and is
    stored in (class A) page i, to complete its
    object-by-object distance calculations, the
    following 3 kinds of tuples are needed
  • white tuples Q mapped to a white L2 neighbour of
    Cw the pair (P, Q) is kept in main memory
  • non-white tuples Q mapped to a non-white L2
    neighbour of Cw, but stored in page j gt i P is
    already in main memory when Q is read

25
FindAllOutsD (cont)
  • non-white tuples Q mapped to a non-white L2
    neighbour of Cw, but stored in page j lt i when Q
    is read, P has not been read yet. Q is not kept
    since it it non-white, so when P is read, Q is
    gone. Thus page j needs to be re-read.
  • Algorithm FindAllOutsD requires at most 3 passes
    over the dataset

26
Comparisons
27
Comparisons
  • NL
  • n blocks
  • pages read n (n-2)(n-1)
  • dataset passes (n (n-2)(n-1))/n
    (n2-2n2)/n n - 2 2/n gt n - 2
  • CS - cell structure
  • FindAllOutsM
  • FindAllOutsD
  • at most 3 dataset passes

28
Comparisons (cont)
  • CPU I/O time
  • in varying dataset size
  • NL exponential
  • CS linear
  • 0.1 thousand tuplesNL 27.67s FindAllOutsM
    1.43s
  • 2 million tuples NL 2332.10s FindAllOutsD
    256s

29
Comparisons (cont)
  • in varying the number of dimensions (k) and cells
  • CS outperforms NL for 3-D, 4-D beyond 5-D, NL
    wins clearly due to the exponential growth of
    cells in CS
  • since the same amount of memory is given to both
    CS and NL, as k increases, number of cells
    increases, so the amount of buffer space
    available to NL increases

30
Comparisons (cont)
  • For CS, in varying value of p
  • as p increases, the number of red, pink cells
    (with non-outliers) increases, so the processing
    time is less

31
Conclusion
  • cell-based algorithms (O(ck N)) for k lt 5
  • nested-loop algorithm (O(kN2)) for k gt 4
  • finding all DB-outliers is computationally very
    feasible for large, multidimensional datasets
  • using NL, there is no practical limit on the size
    of the dataset or on the number of dimensions

32
Future Works
33
Future Works
  • modify the page reading order in NL to reduce the
    number of page read
  • modify NL algorithm to run in parallel machines

34
Questions and Answers
  • Thank you very much
Write a Comment
User Comments (0)
About PowerShow.com