Selected Research Results - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Selected Research Results

Description:

Selected Research Results & Applications of WSU' Data Mining Research Lab Guozhu Dong PhD, Professor Data Mining Research Lab Wright State University – PowerPoint PPT presentation

Number of Views:236

Avg rating:3.0/5.0

Slides: 36

Provided by: wrightEdu

Category:

more less

Transcript and Presenter's Notes

Title: Selected Research Results

1
Selected Research Results Applications of WSU'
Data Mining Research Lab

Guozhu Dong
PhD, Professor
Data Mining Research Lab
Wright State University

2
Outline

Contrast data mining
Contrast pattern based classifiers
Contrast pattern mining on sequence data
Real-time mining/analysis of sensor network data
Multi-dimensional multi-level data mining in data
cubes
Mining large collections of time series
Microarray concordance analysis
Summarizing clusterings of abstracts/articles
Alternative clustering
Conversion of undesirable objects
Data mining for knowledge transfer
Comparative summary of search results

Focus on the bold topics
3
Contrast data mining - What Why ?

Contrast - To compare or appraise in respect to
differences (Merriam Webster Dictionary)
Contrast data mining - The mining of patterns and
models contrasting two or more classes,
conditions, or datasets.
Why
Sometimes its good to contrast what you like
with something else. It makes you appreciate it
even more
Darby Conley, Get Fuzzy, 2001
Useful for understanding, prediction/classificatio
n, outlier detection,

4
What can be contrasted ?

Objects at different time periods
Compare ICDM papers published in 2006-2007
versus those in 2004-2005 to find emerging
research directions
Objects for different spatial locations
Find the distinguishing patterns of cars sold
in the south, versus those sold in the north
Objects across different classes
Find the key differences between normal colon
tissues and cancerous colon tissues

5
How do we contrast two datasets, without advanced
mining tools?

Let D1 and D2 be the two datasets.
We usually find a prototypical case p1 for D1,
and a prototypical case p2 for D2. Then we
compare p1 against p2.
We may also compare the distribution of D1
against that of D2.
Such simplifications often miss the interesting
contrast patterns.

6
Alternative names for contrast data
mining/patterns

Contrast data mining is related to change mining,
difference mining, discriminator mining,
classification rule mining,
Contrast patterns are related to these patterns
Change patterns, class based association
rules, contrast sets, concept drift, difference
patterns, discriminative patterns,
(dis)similarity patterns, emerging patterns,
gradient patterns, high confidence patterns,
(in)frequent patterns,

7
How is contrast data mining used ?

Domain understanding
Young children with diabetes have a greater
risk of hospital admission, compared to the rest
of the population
Used for building classifiers
Many different techniques - to be covered later
Also used for weighting and ranking instances
Used for monitoring
Tell me when something unusual (unlike others
in this class) arrives
Understanding can help us do prevention,
prediction can help us do treatment. An ounce of
prevention is worth a pound of cure!

8
Emerging Patterns
Support frequency

Emerging Patterns (EPs) are contrast patterns
between two classes of data whose support changes
significantly between the two classes.
Significant change can be defined by
If supp2(X)/supp1(X) infinity, then X is a
jumping EP.
jumping EP occurs in some members of one class
but never occurs in the other class.
Here, X is the AND of a set of simple conditions.
Extension to OR was also studied

similar to RiskRatio allowing patterns with
small overall support

big support ratio
supp2(X)/supp1(X) gt minRatio

big support difference
supp2(X) supp1(X) gt minDiff

(as defined by BayPazzani 99)
9
Example EP in microarray data for cancer

Normal Tissues Cancer Tissues
EP example Xg1L,g2H,g3L suppN(X)50,
suppC(X)0
Use minimality to reduce number of mined EPs

binned data
g1 g2 g3 g4
L H L H
L H L L
H L L H
L H H L
g1 g2 g3 g4
H H L H
L H H H
L L L H
H H H L
genes
tissues
10
Top support minimal jumping EPs for colon cancer
These EPs have 95--100 support in one class but
0 support in the other class. Minimal Each
proper subset occurs in both classes.

Colon Cancer EPs
1 4- 112 113 100
1 4- 113 116 100
1 4- 113 221 100
1 4- 113 696 100
1 108- 112 113 100
1 108- 113 116 100
4- 108- 112 113 100
4- 109 113 700 100
4- 110 112 113 100
4- 112 113 700 100
4- 113 117 700 100
1 6 8- 700 97.5

Colon Normal EPs 12- 21- 35 40 137 254
100 12- 35 40 71- 137 254 100 20- 21-
35 137 254 100 20- 35 71- 137 254
100 5- 35 137 177 95.5 5- 35 137 254
95.5 5- 35 137 419- 95.5 5- 137 177
309 95.5 5- 137 254 309 95.5 7- 21- 33
35 69 95.5 7- 21- 33 69 309 95.5 7-
21- 33 69 1261 95.5
EPs from MaoDong 05 (gene club border-diff).
There are 1000 items with supp gt 80.
Colon cancer dataset (Alon et al, 1999 (PNAS))
40 cancer tissues, 22 normal tissues. 2000 genes
Very few 100 support EPs.
11
Besides uses discussed earlier, another potential
use of minimal jumping EPs

Minimal jumping EPs for normal tissues
? Properly expressed gene groups important for
normal cell functioning, but destroyed in all
colon cancer tissues
? Restore these ? ?cure colon cancer?
Minimal jumping EPs for cancer tissues
? Bad gene expression groups that occur in
some cancer tissues but never occur in normal
tissues
? Disrupt these ? ?cure colon cancer?
? Possible targets for drug design ?

LiWong 02 proposed gene therapy using EP idea
Paper using EP published in Cancer Cell (cover,
3/02). EPs have been applied in medical
applications for diagnosing acute Lymphoblastic
Leukemia etc.
12
EP Mining Algorithms and Studies

Complexity result (Wang et al 05)
Border-differential algorithm (DongLi 99)
Gene club border differential (MaoDong 05)
Constraint-based approach (Zhang et al 00)
Tree-based approach (Bailey et al 02,
FanKotagiri 02)
Projection based algorithm (Bailey el al 03)
ZBDD based method (LoekitoBailey 06)
Equivalence class based (Li et al 07).

Can handle 200 dimensions
13
Contrast pattern based classification -- history

Contrast pattern based classification Methods to
build or improve classifiers, using contrast
patterns
CBA (Liu et al 98)
CAEP (Dong et al 99)
Instance based method DeEPs (Li et al 00, 04)
Jumping EP based (Li et al 00), Information based
(Zhang et al 00), Bayesian based (FanKotagiri
03), improving scoring for gt3 classes (Bailey et
al 03)
CMAR (Li et al 01)
Top-ranked EP based PCL (LiWong 02)
CPAR (YinHan 03)
Weighted decision tree (AlhammadyKotagiri 06)
Rare class classification (AlhammadyKotagiri 04)
Constructing supplementary training instances
(AlhammadyKotagiri 05)
Noise tolerant classification (FanKotagiri 04)
One-class classification/detection of outlier
cases (ChenDong 06)
Most follow the aggregating approach of CAEP.

14
EP-based classifiers rationale

Consider a typical EP in the Mushroom dataset,
odor none, stalk-surface-below-ring smooth,
ring-number one its support increases from
0.2 from poisonous to 57.6 in edible
(support ratio 288).
Strong differentiating power if a test case T
contains this EP, we can predict T as edible with
high confidence 99.6 57.6/(57.60.2)
A single EP is usually sharp in telling the class
of a small fraction (e.g. 3) of all instances.
Need to aggregate the power of many EPs to make
the classification.
EP based classification methods often out perform
state of the art classifiers, including C4.5 and
SVM. They are also noise tolerant.

15
CAEP (Classification by Aggregating Emerging
Patterns)

Given a test case T, obtain Ts scores for each
class, by aggregating the discriminating power of
EPs contained in T assign the class with the
maximal score as Ts class.
The discriminating power of EPs are expressed in
terms of supports and growth rates. Prefer large
supRatio, large support

The contribution of one EP X (support weighted
confidence)

CMAR aggregates Chi2 weighted Chi2
strength(X) sup(X) supRatio(X) /
(supRatio(X)1)

Given a test T and a set E(Ci) of EPs for class
Ci, the aggregate score of T for Ci is

score(T, Ci) S strength(X) (over X of Ci
matching T)

For each class, may use median (or 85)
aggregated value to normalize to avoid bias
towards class with more EPs

16
How CAEP works? An example
Class 1 (D1)

Given a test case Ta,d,e, how to classify T?

a c d e
a e
b c d e
b

T contains EPs of class 1 a,e (5025) and
d,e (5025), so Score(T, class1)

0.50.5/(0.50.25) 0.50.5/(0.50.25)
0.67
Class 2 (D2)

T contains EPs of class 2 a,d (2550), so
Score(T, class 2) 0.33
T will be classified as class 1 since
Score1gtScore2

a b
a b c d
c e
a b d e
17
DeEPs (Decision-making by Emerging Patterns)

An instance based (lazy) learning method, like
k-NN but does not use the normal distance
measure.
For a test instance T, DeEPs
First project all training instances to contain
only items in T
Discover EPs from the projected data
Use these EPs to get the training data that match
some discovered EPs
Finally, use the proportional size of matching
data in a class C as Ts score for C
Advantage disallow similar EPs to give duplicate
votes!

18
Why EP-based classifiers are good

Use the discriminating power of low support EPs
(with high supRatio), in addition to the high
support ones
Use multi-feature conditions, not just
single-feature conditions
Select from larger pools of discriminative
conditions
Compare Search space of patterns for decision
trees is limited by early greedy choices.
Aggregate/combine the discriminating power of a
diversified committee of experts (EPs)
Decision of such classifiers is highly explainable

19
Also Studied Contrast Pattern Mining for

Sequence family A vs sequence family B
Graph collection A vs graph collection B
Build contrast pattern based clustering quality
index
Constructing synthetic training data for classes
with few training instances
More than 6 PhD dissertations
About 50 research papers
A tutorial given at IEEE ICDM 2007

20
Multi-dimensional multi-level data mining in data
cubes

Data cube is used for discovering patterns
captured in consolidated historical data for a
company/organization
rules, anomalies, unusual factor combinations
Data cube is focused on modeling analysis of
data for decision makers, not daily operations.
Data organized around major subjects or factors,
such as customer, product, time, sales.
Cube contains huge number of MDML sumaries for
segments or sectors at different levels of
details
Basic OLAP operations Drill down, roll up, slice
and dice, pivot

21
Data Cubes Base Table Hierarchies

Base table stores sales volume (measure), a
function of product, time, location (dimensions)

Hierarchical summarization paths
Time
Location
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
all (as top of each dimension)
a base cell
22
Data Cubes Derived Cells
Measures sum, count, avg, max, min, std,
(TV,,Mexico)
Derived cells, different levels of details
23
Gradient mining in data cubes

Find syntactically similar cells with
significantly different measure values
EG
(house,California,May,2008), total-sale100M
vs (house,Iowa,May,2008), total-sale 200M
This is made up to show the point

Other people studied iceberg cubes, cells
significantly different from neighbors,
24
Multi-Dimensional Trends Analysis of Sets of
Time-Series in Data Cubes

Consider applications having many time series
ECG curves, stocks, power grids, sensor networks,
internet, gene expressions for toxicology study,
Need MDML trends analysis
Mining/monitoring unusual patterns/events, in
MDML manner
E.G. Find good sets of stocks with desired total
risk/reward ratios
Regression cube for time series
Store regression base cube
Support MDML OLAP of regressions
Results also useful for MDML data stream
monitoring

25
Example Aggregating Set of Time Series

Two component cells
Aggregated cell

Deriving regression of aggregated cell from
regression of component cells
Data Mining Results and Applications
Guozhu Dong
26
In-Network Detection of Shapes of Region-Based
Events in Sensor Networks
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Sensor Node
Each sensor can sense events, and talk with
neighbors
27
Research Problems Studied

Detection of Region-Based Events given a sensor
network, when a region-based event occurs, report
the spatial geometric information, which may
include
the boundaries and the shape of the region
positions of important points
important metrics length, area, density
Tracking of Region-Based Events after initial
detection of a region-based event, determine its
spatial dynamic parameters (moving direction,
speed, expansion rate of area, etc).
Computation is done in the sensor network, which
is organized into an R-tree.

28
Multiple platforms/labs dataset
concordance/consistency evaluation

Microarrays (supplied by different manufactures)
are used to measure gene expressions in tissues,
by different labs.
Without knowing the concordance between
platform/lab conditions, it is hard to transfer
knowledge (patterns/classifiers) from one lab to
another
We provide measures and techniques to address
this problem, based on discriminating
gene/classifier transferability

29
Summarizing clusterings of documents

We often need to process large collections of
documents (abstracts, articles, google search, )
We need methods to help us quickly get a sense of
the main themes of the documents
We gave methods to find summary word sets
(cluster description sets) to describe
clusterings of documents
Words in a summary set for a cluster should be
typical in the cluster, and be rare in other
clusters

30
Alternative Clustering

Clustering is usually performed on poorly
understood datasets
Multiple clusterings (ways to group the data) may
exist
Need methods to discover alternative clusterings
We gave algorithms to solve this problem, and
introduced a new similarity measure between
clusterings

31
Undesirable object converter mining

We have a class of desirable objects and a class
of undesirable objects.
The goal is to mine small sets of attribute
changes, which when applied to undesirable
objects, may change those objects class from
undesirable to desirable.
We considered two types of converter sets
personalized, and universal
We gave algorithms to mine them

32
Data mining for knowledge transfer

We have two application domains a well
understood one and a less understood one.
The goal is to mine knowledge that can be
transferred from the well understood domain to
the less understood domain, to solve problems in
the less understood domain

33
Comparative summary of search results

We often perform multiple searches on the web or
on a document collection.
There is an information overload, when we process
the search results.
We developed tools to compare and summarize the
search results to reduce the information
overload.
Compare two searches -- examples
Same key words searched at two time points
Same key words searched over two locations etc

34
Outline of Some Recent Works, Review