FINAL PROJECT - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

FINAL PROJECT

Description:

Classification and clustering in high dimensions are NOTORIOUSLY DIFFICULT PROBLEMS. ... well on very high dimensions in which the conventional methods ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 15
Provided by: usersC8
Category:

less

Transcript and Presenter's Notes

Title: FINAL PROJECT


1
FINAL PROJECT
  • COURSE NAME Principles of Data Mining
  • COURSE CODE COP-5992
  • PROFESSOR Dr. Tao Li
  • APPROXIMATE DISTANCE CLASSIFICATION
  • BY
  • RAMAKRISHNA VARADARAJAN

2
AN OVERVIEW
  • THE PROBLEM Curse of Dimensionality.
  • THE SOLUTION.
  • THE METHODS.
  • ADC PROJECTIONS.
  • EXAMPLES.
  • EVALUATING THE PROJECTIONS.
  • IMPLEMENTATION OF THE METHOD.
  • CONCLUSIONS.

3
THE PROBLEM
  • Classification and clustering in high dimensions
    are NOTORIOUSLY DIFFICULT PROBLEMS.
  • Instead of searching for clustering structure of
    the original, highdimensional observations, it
    is common practice to employ dimension reduction
    methods.
  • The question How to project?'' naturally arises.

4
THE SOLUTION
  • In 1997, Cowen and Priebe 4 introduced a class
    of non linear projections that is easy to
    construct and has been demonstrated to preserve
    clustering structure in high-dimensional data
    sets that strongly cluster.
  • The motivation behind their work is to reduce
    dimensionality while approximately preserving
    inter-cluster distances.
  • The projections developed by Cowen and Priebe
    approximately preserve inter-class distances.

5
THE METHOD
  • Consequently the classification and clustering
    techniques based on them are referred to as
    Approximate Distance Classification and
    Clustering methods or ADC methods for short.
  • ADVANTAGES
  • No Pre-processing needed.
  • No Data dependent adjustments needed.
  • Performs surprisingly well on very high
    dimensions in which the conventional methods fail
    for theoretical and computational reasons.

6
ISSUES TO BE ADDRESSED
  • While using ADC METHODS we come across
  • the following issues
  • How many projections do we need to generate to
    get some that are useful?
  • How do we distinguish the best'' (or most
    useful) projections from the rest?

7
ADC PROJECTIONS
  • Given a set of observations in a highdimensional
    space we first seek a projection of the data into
    a lowerdimensional space for which approximate
    inter-cluster distances are maintained.
  • Definition Let SX1,X2,,Xn be a collection
    of n vectors (n-instances) in d-dimension
    (d-attributes). Let D be a subset of S
    (instances) and . denote the L2 norm. The
    associated ADC map is defined as the function.
  • ADC-D Xi ? min Xi Z . where Z is an
    element of D
  • L2 NORM

8
WITNESS SETS
  • The set D (the subset of instance set) in the
    above definition will be referred to as the
    witness set that generates its associated
    projection.
  • Clearly each ADC map is completely determined by
    the witness set used and each determines a
    projection from d-dimension to 1-dimension.
  • In what follows, we will always choose the
    witness set entirely from one of the classes,
    without loss of generality.

9
EXAMPLE ADC PROJECTION
  • IRIS DATA
  • Number of Instances 150 (50 in each of three
    classes)
  • Number of Attributes 4 numeric, predictive
    attributes and a class
  • attribute.
  • Example instances of iris data ( first five)
  • 5.1,3.5,1.4,0.2,Iris-setosa
  • 4.9,3.0,1.4,0.2,Iris-setosa
  • 4.7,3.2,1.3,0.2,Iris-setosa
  • 4.6,3.1,1.5,0.2,Iris-setosa
  • 5.0,3.6,1.4,0.2,Iris-setosa
  • 5.4,3.9,1.7,0.4,Iris-setosa
  • In our definition
  • S set of all instances D subset of instances
    called witness set
  • Lets select a witness set of size 2 randomly
    (always within one class). While doing ADC
    projection we dont take class attribute to
    consideration for dimensionality reduction.
  • D (5.0,3.6,1.4,0.2) and (5.4,3.9,1.7,0.4)

10
EXAMPLE ADC PROJECTION (continued)..
  • Now we apply the ADC projection for the first
    instance of IRIS data First Instance
    5.1,3.5,1.4,0.2.
  • Calculate distance between first instance and
    (5.0,3.6,1.4,0.2) by using L2 norm( Distance
    Formula). Result will be 0.141.
  • Calculate distance between first instance and
    (5.4,3.9,1.7,0.4) by using L2 norm( Distance
    Formula). Result will be 0.616.
  • Then find the minimum of the 2 results. In our
    case it is 0.141.
  • So the first instance is projected to 0.141.
  • Repeat this procedure for all the remaining
    instances to get the overall projection of the
    IRIS data by the selected witness set.

11
IDENTIFYING GOOD WITNESS SETS
  • You can vary the size of the witness set and also
    the elements in the witness set, but within a
    single class to preserve the inter-cluster
    distances and relationships.
  • Once the data is projected using a selected
    witness set, the resulting projected data is
    classified using a conventional method( for
    example K-nearest neighbor or any other
    classification algorithm).
  • We call the conventional function used in
    combination with the ADC, the ADC-sub classifier

12
EVALUATING THE PROJECTIONS
  • There are many existing methods to measure the
    quality of a projection.
  • The projection generated by D can be evaluated
    with respect to a particular ADC sub-classifier
    by using CROSS-VALIDATION.
  • When presented with a new unlabeled observation
    X, the method classifies by projecting X to
    one-dimensional space using ADC-D and then
    labeling ADC-D (X) using the selected ADC-sub
    classifier.

13
PROCEDURE SO FAR..
  • First, we sample w witness sets from the set of
    all size s subsets of the training data in a
    single class. For example In IRIS data, there
    are 3 classes( 50 instances in each). So if we
    select the size s to be 3, there are 50 C 3
    19600 possible witness sets and we can sample
    any number (w) from it for projection.
  • Evaluate the witness sets using Cross-Validation.
  • Now we select the r best scoring witness sets,
    where r is called filtering parameter.
  • THE PARAMETERS w, s AND r CAN BE VARIED.

14
IMPLEMENTATION
Write a Comment
User Comments (0)
About PowerShow.com