FINAL LECTURE - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

FINAL LECTURE

Description:

FINAL LECTURE Prof. Navneet Goyal CSIS Department, BITS-Pilani * * * * * * * * * * * * Topics Grid-based Clustering Anomaly/Outlier Analysis Research Trends in Data ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 22
Provided by: csisBits6
Category:

less

Transcript and Presenter's Notes

Title: FINAL LECTURE


1
FINAL LECTURE
  • Prof. Navneet Goyal
  • CSIS Department, BITS-Pilani

2
Topics
  • Grid-based Clustering
  • Anomaly/Outlier Analysis
  • Research Trends in Data Mining
  • Stream Data Mining
  • Unstructured Data Mining
  • Multi-relational Data Mining
  • Applications
  • Sciences
  • Social
  • WWW

3
FINAL LECTURE
  • This is not the end.
  • It is not even the beginning of the end.
  • But it is, perhaps, the end of the beginning!
  • - Winston Churchill

4
Grid-based Clustering
  • DBSCAN simple but effective algorithm for
    finding density-based clusters, i.e., dense
    regions of objects that are surrounded by
    low-density regions
  • We now look at additional density-based
    clustering techniques that address issues of
  • Efficiency (GRID METHODS STING)
  • Finding clusters in subspaces (CLIQUE)
  • More accurately modeling density (DENCLUE)

5
Grid-based Clustering
  • Using multi-resolution grid data structure
  • Several interesting methods
  • STING (a STatistical INformation Grid approach)
    by Wang, Yang and Muntz (1997)
  • WaveCluster by Sheikholeslami, Chatterjee, and
    Zhang (VLDB98)
  • A multi-resolution clustering approach using
    wavelet method
  • CLIQUE Agrawal, et al. (SIGMOD98)

6
STING
  • STatistical INformation Grid approach) by Wang,
    Yang and Muntz (VLDB97)
  • The spatial area area is divided into rectangular
    cells
  • There are several levels of cells corresponding
    to different levels of resolution

7
STING
  • Each cell at a high level is partitioned into a
    number of smaller cells in the next lower level
  • Statistical info of each cell is calculated and
    stored beforehand and is used to answer queries
  • Parameters of higher level cells can be easily
    calculated from parameters of lower level cell
  • count, mean, s, min, max
  • type of distributionnormal, uniform, etc.
  • Use a top-down approach to answer spatial data
    queries
  • Start from a pre-selected layertypically with a
    small number of cells
  • For each cell in the current level compute the
    confidence interval

8
STING
  • Remove the irrelevant cells from further
    consideration
  • When finish examining the current layer, proceed
    to the next lower level
  • Repeat this process until the bottom layer is
    reached
  • Advantages
  • Query-independent, easy to parallelize,
    incremental update
  • O(K), where K is the number of grid cells at the
    lowest level
  • Disadvantages
  • All the cluster boundaries are either horizontal
    or vertical, and no diagonal boundary is detected

9
CLIQUE (Clustering In QUEst)
  • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD98).
  • Automatically identifying subspaces of a high
    dimensional data space that allow better
    clustering than original space
  • CLIQUE can be considered as both density-based
    and grid-based
  • It partitions each dimension into the same number
    of equal length interval
  • It partitions an m-dimensional data space into
    non-overlapping rectangular units
  • A unit is dense if the fraction of total data
    points contained in the unit exceeds the input
    model parameter
  • A cluster is a maximal set of connected dense
    units within a subspace

10
CLIQUE The Major Steps
  • Partition the data space and find the number of
    points that lie inside each cell of the
    partition.
  • Identify the subspaces that contain clusters
    using the Apriori principle
  • Identify clusters
  • Determine dense units in all subspaces of
    interests
  • Determine connected dense units in all subspaces
    of interests.
  • Generate minimal description for the clusters
  • Determine maximal regions that cover a cluster of
    connected dense units for each cluster
  • Determination of minimal cover for each cluster

11
Salary (10,000)
7
6
5
4
3
2
1
age
0
20
30
40
50
60
? 3
12
Anomaly/Outlier Analysis
  • What are outliers?
  • Outlier (noun) something that is situated away
    from or classed differently from a main or
    related body
  • A statistical observation that is markedly
    different in value from the others of the sample
  • The set of objects are considerably dissimilar
    from the remainder of the data
  • Example Sports Michael Jordon, Sachin
    Tendulkar, Tiger Woods

13
Anomaly/Outlier Analysis
  • Applications
  • Credit card fraud detection
  • Telecom fraud detection
  • IDS
  • Terrorism Prevention

14
Anomaly/Outlier Detection
  • What are anomalies/outliers?
  • The set of data points that are considerably
    different than the remainder of the data
  • Variants of Anomaly/Outlier Detection Problems
  • Given a database D, find all the data points x ?
    D with anomaly scores greater than some threshold
    t
  • Given a database D, find all the data points x ?
    D having the top-n largest anomaly scores f(x)
  • Given a database D, containing mostly normal (but
    unlabeled) data points, and a test point x,
    compute the anomaly score of x with respect to D

15
Importance of Anomaly Detection
  • Ozone Depletion History
  • In 1985 three researchers (Farman, Gardinar and
    Shanklin) were puzzled by data gathered by the
    British Antarctic Survey showing that ozone
    levels for Antarctica had dropped 10 below
    normal levels
  • Why did the Nimbus 7 satellite, which had
    instruments aboard for recording ozone levels,
    not record similarly low ozone concentrations?
  • The ozone concentrations recorded by the
    satellite were so low they were being treated as
    outliers by a computer program and discarded!

Sources http//exploringdata.cqu.edu.au/ozon
e.html http//www.epa.gov/ozone/science/hole
/size.html
16
Anomaly Detection
  • Challenges
  • How many outliers are there in the data?
  • Method is unsupervised
  • Validation can be quite challenging (just like
    for clustering)
  • Finding needle in a haystack
  • Working assumption
  • There are considerably more normal observations
    than abnormal observations (outliers/anomalies)
    in the data

17
Outlier Discovery Statistical Approaches
  • Assume a model underlying distribution that
    generates data set (e.g. normal distribution)
  • Use discordancy tests depending on
  • data distribution
  • distribution parameter (e.g., mean, variance)
  • number of expected outliers
  • Drawbacks
  • most tests are for single attribute
  • In many cases, data distribution may not be known

18
Outlier Discovery Distance-Based Approach
  • Introduced to counter the main limitations
    imposed by statistical methods
  • We need multi-dimensional analysis without
    knowing data distribution.
  • Distance-based outlier A DB(p, D)-outlier is an
    object O in a dataset T such that at least a
    fraction p of the objects in T lies at a distance
    greater than D from O
  • Algorithms for mining distance-based outliers
  • Index-based algorithm
  • Nested-loop algorithm
  • Cell-based algorithm

19
Anomaly Detection Schemes
  • General Steps
  • Build a profile of the normal behavior
  • Profile can be patterns or summary statistics for
    the overall population
  • Use the normal profile to detect anomalies
  • Anomalies are observations whose
    characteristicsdiffer significantly from the
    normal profile
  • Types of anomaly detection schemes
  • Graphical Statistical-based
  • Distance-based
  • Model-based

20
Convex Hull Method
  • Extreme points are assumed to be outliers
  • Use convex hull method to detect extreme values
  • What if the outlier occurs in the middle of the
    data?

21
Nearest-Neighbor Based Approach
  • Approach
  • Compute the distance between every pair of data
    points
  • There are various ways to define outliers
  • Data points for which there are fewer than p
    neighboring points within a distance D
  • The top n data points whose distance to the kth
    nearest neighbor is greatest
  • The top n data points whose average distance to
    the k nearest neighbors is greatest
Write a Comment
User Comments (0)
About PowerShow.com