Title: Selected Research Results
1Selected Research Results Applications of WSU'
Data Mining Research Lab
- Guozhu Dong
- PhD, Professor
- Data Mining Research Lab
- Wright State University
2Outline
- Contrast data mining
- Contrast pattern based classifiers
- Contrast pattern mining on sequence data
- Real-time mining/analysis of sensor network data
- Multi-dimensional multi-level data mining in data
cubes - Mining large collections of time series
- Microarray concordance analysis
- Summarizing clusterings of abstracts/articles
- Alternative clustering
- Conversion of undesirable objects
- Data mining for knowledge transfer
- Comparative summary of search results
Focus on the bold topics
3Contrast data mining - What Why ?
- Contrast - To compare or appraise in respect to
differences (Merriam Webster Dictionary) - Contrast data mining - The mining of patterns and
models contrasting two or more classes,
conditions, or datasets. - Why
- Sometimes its good to contrast what you like
with something else. It makes you appreciate it
even more - Darby Conley, Get Fuzzy, 2001
- Useful for understanding, prediction/classificatio
n, outlier detection,
4What can be contrasted ?
- Objects at different time periods
- Compare ICDM papers published in 2006-2007
versus those in 2004-2005 to find emerging
research directions - Objects for different spatial locations
- Find the distinguishing patterns of cars sold
in the south, versus those sold in the north - Objects across different classes
- Find the key differences between normal colon
tissues and cancerous colon tissues
5How do we contrast two datasets, without advanced
mining tools?
- Let D1 and D2 be the two datasets.
- We usually find a prototypical case p1 for D1,
and a prototypical case p2 for D2. Then we
compare p1 against p2. - We may also compare the distribution of D1
against that of D2. - Such simplifications often miss the interesting
contrast patterns.
6Alternative names for contrast data
mining/patterns
- Contrast data mining is related to change mining,
difference mining, discriminator mining,
classification rule mining, - Contrast patterns are related to these patterns
- Change patterns, class based association
rules, contrast sets, concept drift, difference
patterns, discriminative patterns,
(dis)similarity patterns, emerging patterns,
gradient patterns, high confidence patterns,
(in)frequent patterns,
7How is contrast data mining used ?
- Domain understanding
- Young children with diabetes have a greater
risk of hospital admission, compared to the rest
of the population - Used for building classifiers
- Many different techniques - to be covered later
- Also used for weighting and ranking instances
- Used for monitoring
- Tell me when something unusual (unlike others
in this class) arrives - Understanding can help us do prevention,
prediction can help us do treatment. An ounce of
prevention is worth a pound of cure!
8Emerging Patterns
Support frequency
- Emerging Patterns (EPs) are contrast patterns
between two classes of data whose support changes
significantly between the two classes.
Significant change can be defined by - If supp2(X)/supp1(X) infinity, then X is a
jumping EP. - jumping EP occurs in some members of one class
but never occurs in the other class. - Here, X is the AND of a set of simple conditions.
Extension to OR was also studied
similar to RiskRatio allowing patterns with
small overall support
- big support ratio
- supp2(X)/supp1(X) gt minRatio
- big support difference
- supp2(X) supp1(X) gt minDiff
(as defined by BayPazzani 99)
9Example EP in microarray data for cancer
- Normal Tissues Cancer Tissues
-
-
- EP example Xg1L,g2H,g3L suppN(X)50,
suppC(X)0 - Use minimality to reduce number of mined EPs
binned data
g1 g2 g3 g4
L H L H
L H L L
H L L H
L H H L
g1 g2 g3 g4
H H L H
L H H H
L L L H
H H H L
genes
tissues
10Top support minimal jumping EPs for colon cancer
These EPs have 95--100 support in one class but
0 support in the other class. Minimal Each
proper subset occurs in both classes.
- Colon Cancer EPs
- 1 4- 112 113 100
- 1 4- 113 116 100
- 1 4- 113 221 100
- 1 4- 113 696 100
- 1 108- 112 113 100
- 1 108- 113 116 100
- 4- 108- 112 113 100
- 4- 109 113 700 100
- 4- 110 112 113 100
- 4- 112 113 700 100
- 4- 113 117 700 100
- 1 6 8- 700 97.5
Colon Normal EPs 12- 21- 35 40 137 254
100 12- 35 40 71- 137 254 100 20- 21-
35 137 254 100 20- 35 71- 137 254
100 5- 35 137 177 95.5 5- 35 137 254
95.5 5- 35 137 419- 95.5 5- 137 177
309 95.5 5- 137 254 309 95.5 7- 21- 33
35 69 95.5 7- 21- 33 69 309 95.5 7-
21- 33 69 1261 95.5
EPs from MaoDong 05 (gene club border-diff).
There are 1000 items with supp gt 80.
Colon cancer dataset (Alon et al, 1999 (PNAS))
40 cancer tissues, 22 normal tissues. 2000 genes
Very few 100 support EPs.
11Besides uses discussed earlier, another potential
use of minimal jumping EPs
- Minimal jumping EPs for normal tissues
- ? Properly expressed gene groups important for
normal cell functioning, but destroyed in all
colon cancer tissues - ? Restore these ? ?cure colon cancer?
- Minimal jumping EPs for cancer tissues
- ? Bad gene expression groups that occur in
some cancer tissues but never occur in normal
tissues - ? Disrupt these ? ?cure colon cancer?
- ? Possible targets for drug design ?
LiWong 02 proposed gene therapy using EP idea
Paper using EP published in Cancer Cell (cover,
3/02). EPs have been applied in medical
applications for diagnosing acute Lymphoblastic
Leukemia etc.
12EP Mining Algorithms and Studies
- Complexity result (Wang et al 05)
- Border-differential algorithm (DongLi 99)
- Gene club border differential (MaoDong 05)
- Constraint-based approach (Zhang et al 00)
- Tree-based approach (Bailey et al 02,
FanKotagiri 02) - Projection based algorithm (Bailey el al 03)
- ZBDD based method (LoekitoBailey 06)
- Equivalence class based (Li et al 07).
Can handle 200 dimensions
13Contrast pattern based classification -- history
- Contrast pattern based classification Methods to
build or improve classifiers, using contrast
patterns - CBA (Liu et al 98)
- CAEP (Dong et al 99)
- Instance based method DeEPs (Li et al 00, 04)
- Jumping EP based (Li et al 00), Information based
(Zhang et al 00), Bayesian based (FanKotagiri
03), improving scoring for gt3 classes (Bailey et
al 03) - CMAR (Li et al 01)
- Top-ranked EP based PCL (LiWong 02)
- CPAR (YinHan 03)
- Weighted decision tree (AlhammadyKotagiri 06)
- Rare class classification (AlhammadyKotagiri 04)
- Constructing supplementary training instances
(AlhammadyKotagiri 05) - Noise tolerant classification (FanKotagiri 04)
- One-class classification/detection of outlier
cases (ChenDong 06) -
- Most follow the aggregating approach of CAEP.
14EP-based classifiers rationale
- Consider a typical EP in the Mushroom dataset,
odor none, stalk-surface-below-ring smooth,
ring-number one its support increases from
0.2 from poisonous to 57.6 in edible
(support ratio 288). - Strong differentiating power if a test case T
contains this EP, we can predict T as edible with
high confidence 99.6 57.6/(57.60.2) - A single EP is usually sharp in telling the class
of a small fraction (e.g. 3) of all instances.
Need to aggregate the power of many EPs to make
the classification. - EP based classification methods often out perform
state of the art classifiers, including C4.5 and
SVM. They are also noise tolerant.
15CAEP (Classification by Aggregating Emerging
Patterns)
- Given a test case T, obtain Ts scores for each
class, by aggregating the discriminating power of
EPs contained in T assign the class with the
maximal score as Ts class. - The discriminating power of EPs are expressed in
terms of supports and growth rates. Prefer large
supRatio, large support
- The contribution of one EP X (support weighted
confidence)
CMAR aggregates Chi2 weighted Chi2
strength(X) sup(X) supRatio(X) /
(supRatio(X)1)
- Given a test T and a set E(Ci) of EPs for class
Ci, the aggregate score of T for Ci is
score(T, Ci) S strength(X) (over X of Ci
matching T)
- For each class, may use median (or 85)
aggregated value to normalize to avoid bias
towards class with more EPs
16How CAEP works? An example
Class 1 (D1)
- Given a test case Ta,d,e, how to classify T?
a c d e
a e
b c d e
b
- T contains EPs of class 1 a,e (5025) and
d,e (5025), so Score(T, class1)
0.50.5/(0.50.25) 0.50.5/(0.50.25)
0.67
Class 2 (D2)
- T contains EPs of class 2 a,d (2550), so
Score(T, class 2) 0.33 - T will be classified as class 1 since
Score1gtScore2
a b
a b c d
c e
a b d e
17DeEPs (Decision-making by Emerging Patterns)
- An instance based (lazy) learning method, like
k-NN but does not use the normal distance
measure. - For a test instance T, DeEPs
- First project all training instances to contain
only items in T - Discover EPs from the projected data
- Use these EPs to get the training data that match
some discovered EPs - Finally, use the proportional size of matching
data in a class C as Ts score for C - Advantage disallow similar EPs to give duplicate
votes!
18Why EP-based classifiers are good
- Use the discriminating power of low support EPs
(with high supRatio), in addition to the high
support ones - Use multi-feature conditions, not just
single-feature conditions - Select from larger pools of discriminative
conditions - Compare Search space of patterns for decision
trees is limited by early greedy choices. - Aggregate/combine the discriminating power of a
diversified committee of experts (EPs) - Decision of such classifiers is highly explainable
19Also Studied Contrast Pattern Mining for
- Sequence family A vs sequence family B
- Graph collection A vs graph collection B
- Build contrast pattern based clustering quality
index - Constructing synthetic training data for classes
with few training instances -
- More than 6 PhD dissertations
- About 50 research papers
- A tutorial given at IEEE ICDM 2007
20Multi-dimensional multi-level data mining in data
cubes
- Data cube is used for discovering patterns
captured in consolidated historical data for a
company/organization - rules, anomalies, unusual factor combinations
- Data cube is focused on modeling analysis of
data for decision makers, not daily operations. - Data organized around major subjects or factors,
such as customer, product, time, sales. - Cube contains huge number of MDML sumaries for
segments or sectors at different levels of
details - Basic OLAP operations Drill down, roll up, slice
and dice, pivot
21Data Cubes Base Table Hierarchies
- Base table stores sales volume (measure), a
function of product, time, location (dimensions)
Hierarchical summarization paths
Time
Location
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
all (as top of each dimension)
a base cell
22Data Cubes Derived Cells
Measures sum, count, avg, max, min, std,
(TV,,Mexico)
Derived cells, different levels of details
23Gradient mining in data cubes
- Find syntactically similar cells with
significantly different measure values - EG
- (house,California,May,2008), total-sale100M
- vs (house,Iowa,May,2008), total-sale 200M
- This is made up to show the point
Other people studied iceberg cubes, cells
significantly different from neighbors,
24Multi-Dimensional Trends Analysis of Sets of
Time-Series in Data Cubes
- Consider applications having many time series
- ECG curves, stocks, power grids, sensor networks,
internet, gene expressions for toxicology study,
- Need MDML trends analysis
- Mining/monitoring unusual patterns/events, in
MDML manner - E.G. Find good sets of stocks with desired total
risk/reward ratios - Regression cube for time series
- Store regression base cube
- Support MDML OLAP of regressions
- Results also useful for MDML data stream
monitoring
25Example Aggregating Set of Time Series
- Two component cells
- Aggregated cell
Deriving regression of aggregated cell from
regression of component cells
Data Mining Results and Applications
Guozhu Dong
26In-Network Detection of Shapes of Region-Based
Events in Sensor Networks
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Sensor Node
Each sensor can sense events, and talk with
neighbors
27Research Problems Studied
- Detection of Region-Based Events given a sensor
network, when a region-based event occurs, report
the spatial geometric information, which may
include - the boundaries and the shape of the region
- positions of important points
- important metrics length, area, density
- Tracking of Region-Based Events after initial
detection of a region-based event, determine its
spatial dynamic parameters (moving direction,
speed, expansion rate of area, etc). - Computation is done in the sensor network, which
is organized into an R-tree.
28Multiple platforms/labs dataset
concordance/consistency evaluation
- Microarrays (supplied by different manufactures)
are used to measure gene expressions in tissues,
by different labs. - Without knowing the concordance between
platform/lab conditions, it is hard to transfer
knowledge (patterns/classifiers) from one lab to
another - We provide measures and techniques to address
this problem, based on discriminating
gene/classifier transferability
29Summarizing clusterings of documents
- We often need to process large collections of
documents (abstracts, articles, google search, )
- We need methods to help us quickly get a sense of
the main themes of the documents - We gave methods to find summary word sets
(cluster description sets) to describe
clusterings of documents - Words in a summary set for a cluster should be
typical in the cluster, and be rare in other
clusters
30Alternative Clustering
- Clustering is usually performed on poorly
understood datasets - Multiple clusterings (ways to group the data) may
exist - Need methods to discover alternative clusterings
- We gave algorithms to solve this problem, and
introduced a new similarity measure between
clusterings
31Undesirable object converter mining
- We have a class of desirable objects and a class
of undesirable objects. - The goal is to mine small sets of attribute
changes, which when applied to undesirable
objects, may change those objects class from
undesirable to desirable. - We considered two types of converter sets
personalized, and universal - We gave algorithms to mine them
32Data mining for knowledge transfer
- We have two application domains a well
understood one and a less understood one. - The goal is to mine knowledge that can be
transferred from the well understood domain to
the less understood domain, to solve problems in
the less understood domain
33Comparative summary of search results
- We often perform multiple searches on the web or
on a document collection. - There is an information overload, when we process
the search results. - We developed tools to compare and summarize the
search results to reduce the information
overload. - Compare two searches -- examples
- Same key words searched at two time points
- Same key words searched over two locations etc
34Outline of Some Recent Works, Review
- Contrast data mining
- Contrast pattern based classifiers
- Contrast pattern mining on sequence data
- Real-time mining/analysis of sensor network data
- Multi-dimensional multi-level data mining in data
cubes - Mining large collections of time series
- Microarray concordance analysis using contrast
patterns - Summarizing clusterings of abstracts/articles
- Alternative clustering
- Conversion of undesirable objects
- Data mining for knowledge transfer
- Comparative summary of search results
35Thank you
- List of papers available at http//www.cs.wright.e
du/gdong/ - Email guozhu.dong_at_wright.edu
- Collaboration opportunities to work on your
problems are welcome