Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions - PowerPoint PPT Presentation

About This Presentation
Title:

Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Description:

... learned based on input conditions. Representative of conditions used to characterize a cluster ... Defining a notion of distance for the input conditions ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 22
Provided by: Apa71
Learn more at: http://web.cs.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions


1
Designing Semantics-Preserving Cluster
Representatives for Scientific Input Conditions
  • Aparna Varde, Elke Rundensteiner,
  • Carolina Ruiz, David Brown,
  • Mohammed Maniruzzaman and Richard Sisson Jr.
  • Worcester Polytechnic Institute
  • Worcester, MA, USA
  • ACM CIKM 2006, Arlington, VA, USA

2
Introduction
  • Clustering often groups data with mixed
    attributes
  • Numeric
  • Categorical
  • Ordinal
  • Examples PDAs, Web Pages, Scientific Experiments
  • Cluster Representatives depictions of each
    cluster
  • Randomly selected representatives not enough in
  • Capturing cluster information
  • Providing ease of interpretation
  • Incorporating different user interests
  • Need for Designing Cluster Representatives

3
Motivating Example
  • Scientific experiments clustered based on results
  • Clustering criteria learned based on input
    conditions
  • Representative of conditions used to characterize
    a cluster
  • Problem with randomly selected representative
  • Distinct combinations of conditions could lead to
    a given cluster

Decision tree learning the clustering
criteria (Heat Treating of Materials)
4
Goals
  • Need to Design Semantics-Preserving Cluster
    Representatives that
  • Capture relevant information in cluster
  • Avoid visual clutter and are easy to interpret
  • Take into account various user interests in
    targeted applications

5
Proposed Approach DesCond
Given Clusters of experiments, conditions
leading to clusters
Define notion of distance for conditions
incorporating domain semantics
Build candidate representatives with
increasing levels of detail
Compare candidates using MDL-based encoding
capturing user interests
Return candidate with lowest
encoding as best for each cluster
6
Main Tasks in DesCond
  • Defining a notion of distance for the input
    conditions
  • Obtaining suitable candidate representatives for
    each cluster
  • Proposing an encoding to compare candidates and
    find a winner

7
Notion of Distance
  • Example Heat Treating of Materials
  • Quenchant Cooling Medium
  • Part The material being treated
  • Probe Characterizes shape, dimension
  • Oxide Thickness of oxide on surface
  • Agitation Extent of agitation of cooling medium
  • Quenchant Temperature Starting temperature of
    cooling medium
  • Define domain-specific distance metric for
    conditions incorporating
  • Data types of attributes
  • Distance between attribute values
  • Weights of the attributes

8
Data Types of the Attributes
  • Categorical
  • Characters or strings with descriptive
    information
  • E.g., Quenchant Name, Part Material, Probe Type
  • Numerical
  • Integers or real numbers
  • E.g., Quenchant Temperature
  • Ordinal
  • Where order matters
  • E.g., Oxide Layer, Agitation Level

9
Distance Between the Attribute Values
  • Categorical
  • Different 1
  • Same 0
  • Numerical
  • Absolute difference between
  • Values or
  • Mean values of ranges
  • Ordinal
  • Map values to integer
  • E.g., Oxide Layer none 0, thin 1, thick 2
  • Absolute difference between mapped values

10
Weights of the Attributes
  • Attribute has higher weight if it
  • Is at higher level in tree
  • Belongs to a shorter path
  • Has more experiments in its corresponding cluster
  • Decision Tree Weight Heuristic
  • Wi 1/P ?j1 to P (Hi,j / Hj) Gj

11
Candidate Representatives in Levels of Detail
  • Level 1 Single Conditions Representative (SCR)
  • One set of conditions preserving cluster
    information
  • Level 2 Multiple Conditions Representative (MCR)
  • Summary of information in cluster
  • Level 3 All Conditions Representative (ACR)
  • All information in cluster abstracted suitably

12
Single Conditions Representative
Input conditions in Cluster A
SCR for Cluster A
  • Return set of conditions closest to all others in
    cluster
  • Notion of distance Domain-specific distance
    metric for conditions

13
Multiple Conditions Representative
Cluster A
  • Build sub-clusters of condition using domain
    knowledge
  • Return nearest sub-cluster representatives
  • Sort them

Sub-clusters within Cluster A
MCR for Cluster A
14
All Conditions Representative
Cluster A
ACR for Cluster A
  • Return all sets of conditions
  • Sort them in ascending order

15
DesCond Encoding to Compare Candidates
  • Analogous to Minimum Description Length (MDL)
  • Theory representative, Examples Sets of
    conditions in cluster
  • Complexity of representative (ease of
    interpretation)
  • Complexity log2 AV
  • A number of attributes, V number of values for
    each attribute
  • Distance of all items from representative
    (information loss)
  • Distance log2 (1/s)?i1 to s D(R,Si)
  • D domain-specific distance metric for conditions
  • s total number of items (sets of conditions) in
    cluster
  • Si each individual item
  • R representative set of conditions
  • DesCond Encoding
  • Effectiveness UBCComplexity UBDDistance
  • UBC, UBD User bias weights for complexity and
    distance

16
Evaluation of DesCond with Domain Expert
Interviews
  • Evaluated with real data in Heat Treating
  • User Bias weights in Encoding reflect interests
    in targeted applications
  • Different data sets and number of clusters
  • For each data set score calculated as follows
  • Consider winning candidate for each cluster
  • Based on DesCond Encoding
  • Score Number of clusters in which candidate is
    winner
  • Example Dataset of size 25 with 5 clusters
  • If SCR wins for 2 clusters, ACR for 3
  • Score SCR2, ACR3

17
Evaluation Results
  • Details
  • Data Set Size 400, Number of Clusters 20
  • Experts provide UBC / UBD values in Encoding
  • Observations
  • Overall winner is MCR
  • As weight for complexity increases, SCR wins
  • Designed better than Random

18
Evaluation with Formal User Surveys
  • DesCond used to design representatives for a
    trademarked estimation tool ref CHTE Center for
    Heat Treating Excellence
  • Formal user surveys conducted in different
    applications of the system
  • Evaluation Process
  • Compare estimation with real data in test set
  • If they match estimation is accurate

19
Evaluation Results
Parameter Selection Applications
Simulation Tool Applications
Intelligent Tutoring Applications
Decision Support Applications
  • Different winners in different applications
  • Results of surveys tally with those of
    Encoding-based evaluation
  • Estimation Accuracy 90 to 94 (better than
    earlier versions of tool)

20
Related Work
  • Image Rating HH-01
  • User intervention involved in manual rating
  • Semantic Fish Eye Views JP-04
  • Display multiple objects in small space, no
    representatives
  • PDA Displays in Levels of Detail BGMP-01
  • Do not evaluate different types of
    representatives

21
Conclusions
  • Contributions of this work
  • Designing cluster representatives for scientific
    input conditions in levels of detail
  • Defining a domain-specific distance metric for
    conditions
  • Proposing an encoding to compare representatives
  • Conducting evaluation using encoding with real
    data from Heat Treating
  • Assessing use of representatives in applications
    of a CHTE trademarked estimation tool
  • Results
  • Designed Representatives better than random
  • Different designed representatives suit different
    applications
  • DesCond enhances accuracy of estimation tool
Write a Comment
User Comments (0)
About PowerShow.com