Fast Kernel-Density-Based Classification and Clustering Using P-Trees - PowerPoint PPT Presentation

About This Presentation
Title:

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Description:

Fast Kernel-Density-Based Classification and Clustering ... mushroom. kr-vs-kp. adult. 20 paths. 0. 0.84. 14.93. Results: Speed. Used on largest UCI data sets ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 25
Provided by: dongm7
Category:

less

Transcript and Presenter's Notes

Title: Fast Kernel-Density-Based Classification and Clustering Using P-Trees


1
Fast Kernel-Density-Based Classification and
Clustering Using P-Trees
  • Anne Denton
  • Major Advisor William Perrizo

2
Outline
  • Introduction
  • P-Trees
  • Concepts
  • Implementation
  • Kernel Methods
  • Paper 1 Rule-Based Classification
  • Paper 2 Kernel-Based Semi-Naïve Bayes Classifier
  • Paper 3 Hierarchical Clustering
  • Outlook

3
Introduction
  • Data Mining
  • Information from data
  • Considers storage issues
  • P-Tree Approach
  • Bit-column-based storage
  • Compression
  • Hardware optimization
  • Simple index construction
  • Flexibility in storage organization

4
P-Tree Concepts
  • Ordering (details)
  • New Generalized Peano order sorting
  • Compression

5
Impact of Peano Order Sorting
6
P-Tree Implementation
  • Implementation in Java
  • Was ported to C / C (Amal Perera, Masum Serazi)
  • Fastest compressing P-tree implementation so far
  • Array indices as pointers (details)
  • Grandchild purity

7
Kernel-Density-Based Classification
  • Probability of an attribute vector x
  • Conditional on class label value ci
  • ? is 1 if ? is true, 0 otherwise
  • Depending on N training points xt
  • Kernel function K(x,xt) can be, e.g., Gaussian
    function or step function

8
Higher Order Basic Bit Distance HOBbit
  • P-trees make count evaluation efficient for the
    following intervals

9
Paper 1 Rule-Based Classification
  • Goal High accuracy on large data sets including
    standard ones (UCI ML Repository)
  • Neighborhood evaluated through
  • Equality of categorical attributes
  • HOBbit interval for continuous attributes
  • Curse of dimensionality
  • Volume empty with high likelihood
  • Information gain to select attributes
  • Attributes considered binary, based on test
    sample (Lazy decision trees, Friedman 96 4)
  • Continuous data Interval around test sample
  • Exact information gain (details)
  • Pursuing multiple paths

10
Results Accuracy
C4.5 20 paths
adult 15.54 14.93
kr-vs-kp 0.8 0.84
mushroom 0 0
  • Comparable to C4.5
  • after much less
  • development time
  • 5 data sets from UCI
  • Machine Learning
  • Repository (details)
  • 2 additional data sets
  • Crop
  • Gene-function
  • Improvement through
  • multiple paths (20)

11
Results Speed
  • Used on largest UCI data sets
  • Scaling of execution time as a function of
    training set size

12
Paper 2 Kernel-Based Semi-Naïve Bayes Classifier
  • Goal Handling many attributes
  • Naïve Bayes
  • x(k) is value
  • of kth attribute
  • Semi-naïve Bayes
  • Correlated attributes are joined
  • Has been done for categorical data
  • Kononenko 91 5, Pazzani 96 6
  • Previously Continuous data discretized

13
Kernel-Based Naïve Bayes
  • Alternatives for continuous data
  • Discretization
  • Distribution function
  • Gaussian with mean
  • and standard
  • deviation from data
  • No alternative for
  • semi-naïve approach
  • Kernel density
  • estimate (Hastie 7)

14
Correlations
  • Correlation between
  • attributes a and b
  • N Number of
  • training points t
  • Kernel function for
  • continuous data
  • dEH Exponential HOBbit distance

15
Results
  • P-tree Naïve Bayes
  • Difference only for
  • continuous data
  • Semi-Naïve Bayes
  • 3 parameter
  • combinations
  • Blue t 1
  • 3 iterations
  • Red t 0.3
  • incl. anti-corr.
  • White t 0.05
  • (t threshold)

16
Paper 3 Hierarchical Clustering 10
  • Goal
  • Understand relationship between standard
    algorithms
  • Combine the best aspects of three major ones
  • Partition-based
  • Relationship to k-medoids 8 demonstrated
  • Same cluster boundary definition
  • Density-based (kernel-based, DENCLUE 9)
  • Similar cluster center definition
  • Hierarchical
  • Follows naturally from above definitions

17
Results Speed Comparison with K-Means
18
Results Clustering Effectiveness
K-means
Our Algorithm
19
Summary
  • P-tree representation for non-spatial data
  • Fast implementation
  • Paper1 Rule-Based Algorithm
  • Test-sample-centered intervals, multiple paths
  • Competitive on standard (UCI) data
  • Paper 2 Kernel-Based Semi-Naïve Bayes
  • New algorithm to handle large attribute numbers
  • Attribute joining shown to be beneficial
  • Paper 3 Hierarchical Clustering 10
  • Competitive for speed and effectiveness
  • Hierarchical structure

20
Outlook
  • Software engineering aspects
  • Column-oriented design
  • Relationship with P-tree API
  • Non-standard data
  • Data with graph structure
  • Hierarchical data, concept slices 11
  • Visualization
  • Visualization of data on a graph

21
Software Engineering
  • Business-problems row-based
  • Match between database tables and objects
  • Scientific / engineering problems column-based
  • Collective properties of interest
  • Standard OO unsuitable, instead
  • Fortran
  • Array-based languages (ZPL)
  • Solution?
  • Design pattern?
  • Library?

22
Ptree API
23
Non-standard Data
  • Types of data
  • Biological (KDD-cup 02 Our team got honorary
    mention!)
  • Sensor Networks
  • Types of problems
  • Small probability of minority class label
  • A-ROC Evaluation
  • Multi-valued attributes
  • Bit-vector representation ideal for P-trees
  • Graphs
  • Rich supply of new problems / techniques (work
    with Chris Besemann)
  • Hierarchical categorical attributes 11

24
Visualization
  • Idea
  • Use graph visualization tool
  • E.g. http//www.touchgraph.com/
  • Visualize node data through glyphs
  • Visualize edge data
Write a Comment
User Comments (0)
About PowerShow.com