FINAL PROJECT

About This Presentation

Title:

Description:

Number of Views:45

Avg rating:3.0/5.0

Slides: 15

Provided by: usersC8

Category:

Tags: final | project | dimensions

Transcript and Presenter's Notes

Title: FINAL PROJECT

1
FINAL PROJECT

2
AN OVERVIEW

3
THE PROBLEM

Classification and clustering in high dimensions
are NOTORIOUSLY DIFFICULT PROBLEMS.
Instead of searching for clustering structure of
the original, highdimensional observations, it
is common practice to employ dimension reduction
methods.
The question How to project?'' naturally arises.

4
THE SOLUTION

In 1997, Cowen and Priebe 4 introduced a class
of non linear projections that is easy to
construct and has been demonstrated to preserve
clustering structure in high-dimensional data
sets that strongly cluster.
The motivation behind their work is to reduce
dimensionality while approximately preserving
inter-cluster distances.
The projections developed by Cowen and Priebe
approximately preserve inter-class distances.

5
THE METHOD

Consequently the classification and clustering
techniques based on them are referred to as
Approximate Distance Classification and
Clustering methods or ADC methods for short.
ADVANTAGES
No Pre-processing needed.
No Data dependent adjustments needed.
Performs surprisingly well on very high
dimensions in which the conventional methods fail
for theoretical and computational reasons.

6
ISSUES TO BE ADDRESSED

7
ADC PROJECTIONS

Given a set of observations in a highdimensional
space we first seek a projection of the data into
a lowerdimensional space for which approximate
inter-cluster distances are maintained.
Definition Let SX1,X2,,Xn be a collection
of n vectors (n-instances) in d-dimension
(d-attributes). Let D be a subset of S
(instances) and . denote the L2 norm. The
associated ADC map is defined as the function.
ADC-D Xi ? min Xi Z . where Z is an
element of D
L2 NORM

8
WITNESS SETS

The set D (the subset of instance set) in the
above definition will be referred to as the
witness set that generates its associated
projection.
Clearly each ADC map is completely determined by
the witness set used and each determines a
projection from d-dimension to 1-dimension.
In what follows, we will always choose the
witness set entirely from one of the classes,
without loss of generality.

9
EXAMPLE ADC PROJECTION

IRIS DATA
Number of Instances 150 (50 in each of three
classes)
Number of Attributes 4 numeric, predictive
attributes and a class
attribute.
Example instances of iris data ( first five)
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
In our definition
S set of all instances D subset of instances
called witness set
Lets select a witness set of size 2 randomly
(always within one class). While doing ADC
projection we dont take class attribute to
consideration for dimensionality reduction.
D (5.0,3.6,1.4,0.2) and (5.4,3.9,1.7,0.4)

10
EXAMPLE ADC PROJECTION (continued)..

Now we apply the ADC projection for the first
instance of IRIS data First Instance
5.1,3.5,1.4,0.2.
Calculate distance between first instance and
(5.0,3.6,1.4,0.2) by using L2 norm( Distance
Formula). Result will be 0.141.
Calculate distance between first instance and
(5.4,3.9,1.7,0.4) by using L2 norm( Distance
Formula). Result will be 0.616.
Then find the minimum of the 2 results. In our
case it is 0.141.
So the first instance is projected to 0.141.
Repeat this procedure for all the remaining
instances to get the overall projection of the
IRIS data by the selected witness set.

11
IDENTIFYING GOOD WITNESS SETS

You can vary the size of the witness set and also
the elements in the witness set, but within a
single class to preserve the inter-cluster
distances and relationships.
Once the data is projected using a selected
witness set, the resulting projected data is
classified using a conventional method( for
example K-nearest neighbor or any other
classification algorithm).
We call the conventional function used in
combination with the ADC, the ADC-sub classifier

12
EVALUATING THE PROJECTIONS

There are many existing methods to measure the
quality of a projection.
The projection generated by D can be evaluated
with respect to a particular ADC sub-classifier
by using CROSS-VALIDATION.
When presented with a new unlabeled observation
X, the method classifies by projecting X to
one-dimensional space using ADC-D and then
labeling ADC-D (X) using the selected ADC-sub
classifier.

13
PROCEDURE SO FAR..

First, we sample w witness sets from the set of
all size s subsets of the training data in a
single class. For example In IRIS data, there
are 3 classes( 50 instances in each). So if we
select the size s to be 3, there are 50 C 3
19600 possible witness sets and we can sample
any number (w) from it for projection.
Evaluate the witness sets using Cross-Validation.
Now we select the r best scoring witness sets,
where r is called filtering parameter.
THE PARAMETERS w, s AND r CAN BE VARIED.

14
IMPLEMENTATION

Write a Comment

User Comments (0)