Mining Outliers in Large Datasets - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Mining Outliers in Large Datasets

Description:

To minimize page reads, read Class A pages, then Class B pages, then re-read Class A pages ... Thus page j needs to be re-read. ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 35

Provided by: hunge2

Category:

more less

Transcript and Presenter's Notes

Title: Mining Outliers in Large Datasets

1
Mining Outliers in Large Datasets

Database Group Seminar
by Edward Hung
2pm - 330pm, Sep 25 1998
CYC 313

2
Content

Introduction
Nested-Loop Algorithm (NL)
Cell-Based Approach
Memory-Resident (FindAllOutsM)
2-D, Higher Dimensions
Disk-Resident (FindAllOutsD)
Comparisions
Future Works

3
Introduction

Algorithms for Mining Distance-Based Outliers in
Large Datasets
Edwin M. Knorr and Raymond T. Ng
Definition
An Object O in a dataset T is a DB(p,D)-outlier
if at least fraction p of the objects in T lies
greater than distance D from O
maxmium number of objects within distance D from
an outlier O, i.e. M N (1 - p)

4
Introduction (cont)

Applications
electronic commerce (low-value transactions
expected)
detection of credit card fraud
monitoring of criminal activities
Related Works
distribution-based (need to know distribution)
clustering algorithms (not optimised)

5
Nested-Loop (NL)Algorithm
6
Nested-Loop (NL)

NL algorithm block oriented, nested-loop
total buffer size B of dataset size
total buffer gt 2 halves (first and second
arrays)
dataset is read into arays distance between each
pair of tuples is computed
for each tuple t in first array, a count of its
D-neighbours is maintained

7
NL Algorithm

Algorithm
fill first array (size B/2 of dataset)
for each tuple t in the first array
count how many tuples in first array are close to
t (distance lt D)
Repeat until all blocks are compared to first
array
fill second array with another block
for each tuple t in the first array
increse the count by the number of tuples in
first array close to t (distance lt D)
report tuples with count lt M as outliers (cont)

8
NL Algorithm (cont)

if second array has served as first array before,
stop otherwise, swap the names of first and
second arrays and repeat the above
complexity O(kN2)
blocks read n (n-2)(n-1)
dataset pass (n (n-2)(n-1))/n n - 2 2/n

9
NL Example
A
B
C
D
Dataset
A D
C B
A
A B
A C
A D
B D
C D
C D
C A
C B
D B
A B
Buffer
10
Cell-Based Approach

Cell Structures and Properties
Memory-Resident (FindAllOutsM)
Disk-Resident (FindAllOutsD)

11
Cell Strucures

For 2-D, the space is partitioned into cells or
squares of length
Layer 1 (L1) neighbours of Cx,y
Layer 2 (L2) neighbours of Cx,y

12
Cell Strucures
Cx,y
L1(Cx,y)
L2(Cx,y)
13
Cell Properties

Property 1Any pair of objects within the same
cell is at most distance D/2 apart

D/2
Cx,y
14
Cell Properties (cont)

Property 2If Cu,v is an L1 neighbour of Cx,y,
then any object P Cu,v, and any object Q
Cx,y are at most distance D apart

Cx,y
D
L1(Cx,y)
15
Cell Properties (cont)

Property 3If Cu,v Cx,y is neither an L1 nor
an L2 neighbour of Cx,y, then any object P
Cu,v, and any object Q Cx,y must be gt distance
D apart

Cx,y
...
L1(Cx,y)
L2(Cx,y)

16
Cell Properties (cont)

Still remember the definition?
An Object O in a dataset T is a DB(p,D)-outlier
if at least fraction p of the objects in T lies
greater than distance D from O
maxmium number of objects within distance D from
an outlier O, i.e. M N (1 - p)

17
Cell Properties (cont)

Property 4
(a) If there are gt M objects in Cx,y, none of the
objects in Cx,y is an outlier
(b) If there are gt M objects in Cx,y U L1(Cx,y ),
none of the objects in Cx,y is an outlier
(c) If there are lt M objects in Cx,y U L1(Cx,y )
U L2(Cx,y ), every object in Cx,y is an outlier

18
FindAllOutsM Algorithm (Memory-resident Dataset)

initialise m counts (m is number of cells) (O(m))
for each object P, map P to an appropriate cell
Cq, store P, and increment Countq by 1 (O(N))
for each cell, if Count gt M, label cell red
(O(m))(a red cell has at least M1 objects, so
there are at most N/(m1) red cells)
for each red cell, label its each of L1 neighbour
pink, if it is not red (O(N/(m1))) (co
nt)

19
FindAllOutsM Algorithm (cont)

for each non-empty white cell Cw, do
if Count of Cw L1 gt M, label Cw pink, otherwise
if Count of Cw L1 L2 lt M, mark all objects in
Cw as outliers
else for each object P in Cw, do
if the number of objects close to P (distance lt
D) gt M, mark P as an outlier
(worst case m cells, each has M objects O(mM2)
but MN(1-p) and p is expected to be extremely
close to 1, so O(mN2(1-p)2)approximated by O(m))
Complexity is O(mN)

20
FindAllOutsM for Higher Dimensions

for k dimensions, the space is partitioned into
general k-D cells of length

21
FindAllOutsM for Higher Dimensions (cont)

All properties 1 to 4 are preserved
m (number of cells) is exponential w.r.t. k
(number of dimensions)
complexity of last step in algorithm changes from
O(m) to
total complexity O(ck N)

22
FindAllOutsD

Dataset is Disk-Resident
Goal is to minimize the number of page reads
2 phases where page reads are needed
initial mapping of each object to a cell
object-pairwise object-by-object distance
calculation

23
FindAllOutsD (cont)

all pages are classified into three categories
A. Pages that contain some white tuples
B. Pages that do not contain any white tuples,
but contain tuples mapped to a non-white cell
which is an L2 neighbour of some white cell
C. All other pages
To minimize page reads, read Class A pages, then
Class B pages, then re-read Class A pages

24
FindAllOutsD (cont)

Suppose tuple P is mapped to white cell Cw and is
stored in (class A) page i, to complete its
object-by-object distance calculations, the
following 3 kinds of tuples are needed
white tuples Q mapped to a white L2 neighbour of
Cw the pair (P, Q) is kept in main memory
non-white tuples Q mapped to a non-white L2
neighbour of Cw, but stored in page j gt i P is
already in main memory when Q is read

25
FindAllOutsD (cont)

non-white tuples Q mapped to a non-white L2
neighbour of Cw, but stored in page j lt i when Q
is read, P has not been read yet. Q is not kept
since it it non-white, so when P is read, Q is
gone. Thus page j needs to be re-read.
Algorithm FindAllOutsD requires at most 3 passes
over the dataset

26
Comparisons
27
Comparisons

NL
n blocks
pages read n (n-2)(n-1)
dataset passes (n (n-2)(n-1))/n
(n2-2n2)/n n - 2 2/n gt n - 2
CS - cell structure
FindAllOutsM
FindAllOutsD
at most 3 dataset passes

28
Comparisons (cont)

CPU I/O time
in varying dataset size
NL exponential
CS linear
0.1 thousand tuplesNL 27.67s FindAllOutsM
1.43s
2 million tuples NL 2332.10s FindAllOutsD
256s

29
Comparisons (cont)

in varying the number of dimensions (k) and cells
CS outperforms NL for 3-D, 4-D beyond 5-D, NL
wins clearly due to the exponential growth of
cells in CS
since the same amount of memory is given to both
CS and NL, as k increases, number of cells
increases, so the amount of buffer space
available to NL increases

30
Comparisons (cont)

For CS, in varying value of p
as p increases, the number of red, pink cells
(with non-outliers) increases, so the processing
time is less

31
Conclusion

cell-based algorithms (O(ck N)) for k lt 5
nested-loop algorithm (O(kN2)) for k gt 4
finding all DB-outliers is computationally very
feasible for large, multidimensional datasets
using NL, there is no practical limit on the size
of the dataset or on the number of dimensions

32
Future Works
33
Future Works