Title: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions
1Designing Semantics-Preserving Cluster
Representatives for Scientific Input Conditions
- Aparna Varde, Elke Rundensteiner,
- Carolina Ruiz, David Brown,
- Mohammed Maniruzzaman and Richard Sisson Jr.
- Worcester Polytechnic Institute
- Worcester, MA, USA
- ACM CIKM 2006, Arlington, VA, USA
2Introduction
- Clustering often groups data with mixed
attributes - Numeric
- Categorical
- Ordinal
- Examples PDAs, Web Pages, Scientific Experiments
- Cluster Representatives depictions of each
cluster - Randomly selected representatives not enough in
- Capturing cluster information
- Providing ease of interpretation
- Incorporating different user interests
- Need for Designing Cluster Representatives
3Motivating Example
- Scientific experiments clustered based on results
- Clustering criteria learned based on input
conditions - Representative of conditions used to characterize
a cluster - Problem with randomly selected representative
- Distinct combinations of conditions could lead to
a given cluster
Decision tree learning the clustering
criteria (Heat Treating of Materials)
4Goals
- Need to Design Semantics-Preserving Cluster
Representatives that - Capture relevant information in cluster
- Avoid visual clutter and are easy to interpret
- Take into account various user interests in
targeted applications
5Proposed Approach DesCond
Given Clusters of experiments, conditions
leading to clusters
Define notion of distance for conditions
incorporating domain semantics
Build candidate representatives with
increasing levels of detail
Compare candidates using MDL-based encoding
capturing user interests
Return candidate with lowest
encoding as best for each cluster
6Main Tasks in DesCond
- Defining a notion of distance for the input
conditions - Obtaining suitable candidate representatives for
each cluster - Proposing an encoding to compare candidates and
find a winner
7Notion of Distance
- Example Heat Treating of Materials
- Quenchant Cooling Medium
- Part The material being treated
- Probe Characterizes shape, dimension
- Oxide Thickness of oxide on surface
- Agitation Extent of agitation of cooling medium
- Quenchant Temperature Starting temperature of
cooling medium - Define domain-specific distance metric for
conditions incorporating - Data types of attributes
- Distance between attribute values
- Weights of the attributes
8Data Types of the Attributes
- Categorical
- Characters or strings with descriptive
information - E.g., Quenchant Name, Part Material, Probe Type
- Numerical
- Integers or real numbers
- E.g., Quenchant Temperature
- Ordinal
- Where order matters
- E.g., Oxide Layer, Agitation Level
9Distance Between the Attribute Values
- Categorical
- Different 1
- Same 0
- Numerical
- Absolute difference between
- Values or
- Mean values of ranges
- Ordinal
- Map values to integer
- E.g., Oxide Layer none 0, thin 1, thick 2
- Absolute difference between mapped values
10Weights of the Attributes
- Attribute has higher weight if it
- Is at higher level in tree
- Belongs to a shorter path
- Has more experiments in its corresponding cluster
- Decision Tree Weight Heuristic
- Wi 1/P ?j1 to P (Hi,j / Hj) Gj
11Candidate Representatives in Levels of Detail
- Level 1 Single Conditions Representative (SCR)
- One set of conditions preserving cluster
information - Level 2 Multiple Conditions Representative (MCR)
- Summary of information in cluster
- Level 3 All Conditions Representative (ACR)
- All information in cluster abstracted suitably
12Single Conditions Representative
Input conditions in Cluster A
SCR for Cluster A
- Return set of conditions closest to all others in
cluster - Notion of distance Domain-specific distance
metric for conditions
13Multiple Conditions Representative
Cluster A
- Build sub-clusters of condition using domain
knowledge - Return nearest sub-cluster representatives
- Sort them
Sub-clusters within Cluster A
MCR for Cluster A
14All Conditions Representative
Cluster A
ACR for Cluster A
- Return all sets of conditions
- Sort them in ascending order
15DesCond Encoding to Compare Candidates
- Analogous to Minimum Description Length (MDL)
- Theory representative, Examples Sets of
conditions in cluster - Complexity of representative (ease of
interpretation) - Complexity log2 AV
- A number of attributes, V number of values for
each attribute - Distance of all items from representative
(information loss) - Distance log2 (1/s)?i1 to s D(R,Si)
- D domain-specific distance metric for conditions
- s total number of items (sets of conditions) in
cluster - Si each individual item
- R representative set of conditions
- DesCond Encoding
- Effectiveness UBCComplexity UBDDistance
- UBC, UBD User bias weights for complexity and
distance
16Evaluation of DesCond with Domain Expert
Interviews
- Evaluated with real data in Heat Treating
- User Bias weights in Encoding reflect interests
in targeted applications - Different data sets and number of clusters
- For each data set score calculated as follows
- Consider winning candidate for each cluster
- Based on DesCond Encoding
- Score Number of clusters in which candidate is
winner - Example Dataset of size 25 with 5 clusters
- If SCR wins for 2 clusters, ACR for 3
- Score SCR2, ACR3
17Evaluation Results
- Details
- Data Set Size 400, Number of Clusters 20
- Experts provide UBC / UBD values in Encoding
- Observations
- Overall winner is MCR
- As weight for complexity increases, SCR wins
- Designed better than Random
18Evaluation with Formal User Surveys
- DesCond used to design representatives for a
trademarked estimation tool ref CHTE Center for
Heat Treating Excellence - Formal user surveys conducted in different
applications of the system - Evaluation Process
- Compare estimation with real data in test set
- If they match estimation is accurate
19Evaluation Results
Parameter Selection Applications
Simulation Tool Applications
Intelligent Tutoring Applications
Decision Support Applications
- Different winners in different applications
- Results of surveys tally with those of
Encoding-based evaluation - Estimation Accuracy 90 to 94 (better than
earlier versions of tool)
20Related Work
- Image Rating HH-01
- User intervention involved in manual rating
- Semantic Fish Eye Views JP-04
- Display multiple objects in small space, no
representatives - PDA Displays in Levels of Detail BGMP-01
- Do not evaluate different types of
representatives
21Conclusions
- Contributions of this work
- Designing cluster representatives for scientific
input conditions in levels of detail - Defining a domain-specific distance metric for
conditions - Proposing an encoding to compare representatives
- Conducting evaluation using encoding with real
data from Heat Treating - Assessing use of representatives in applications
of a CHTE trademarked estimation tool - Results
- Designed Representatives better than random
- Different designed representatives suit different
applications - DesCond enhances accuracy of estimation tool