Selected Research Results - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Selected Research Results

Description:

Selected Research Results & Applications of WSU' Data Mining Research Lab Guozhu Dong PhD, Professor Data Mining Research Lab Wright State University – PowerPoint PPT presentation

Number of Views:231
Avg rating:3.0/5.0
Slides: 36
Provided by: wrightEdu
Category:

less

Transcript and Presenter's Notes

Title: Selected Research Results


1
Selected Research Results Applications of WSU'
Data Mining Research Lab
  • Guozhu Dong
  • PhD, Professor
  • Data Mining Research Lab
  • Wright State University

2
Outline
  • Contrast data mining
  • Contrast pattern based classifiers
  • Contrast pattern mining on sequence data
  • Real-time mining/analysis of sensor network data
  • Multi-dimensional multi-level data mining in data
    cubes
  • Mining large collections of time series
  • Microarray concordance analysis
  • Summarizing clusterings of abstracts/articles
  • Alternative clustering
  • Conversion of undesirable objects
  • Data mining for knowledge transfer
  • Comparative summary of search results

Focus on the bold topics
3
Contrast data mining - What Why ?
  • Contrast - To compare or appraise in respect to
    differences (Merriam Webster Dictionary)
  • Contrast data mining - The mining of patterns and
    models contrasting two or more classes,
    conditions, or datasets.
  • Why
  • Sometimes its good to contrast what you like
    with something else. It makes you appreciate it
    even more
  • Darby Conley, Get Fuzzy, 2001
  • Useful for understanding, prediction/classificatio
    n, outlier detection,

4
What can be contrasted ?
  • Objects at different time periods
  • Compare ICDM papers published in 2006-2007
    versus those in 2004-2005 to find emerging
    research directions
  • Objects for different spatial locations
  • Find the distinguishing patterns of cars sold
    in the south, versus those sold in the north
  • Objects across different classes
  • Find the key differences between normal colon
    tissues and cancerous colon tissues

5
How do we contrast two datasets, without advanced
mining tools?
  • Let D1 and D2 be the two datasets.
  • We usually find a prototypical case p1 for D1,
    and a prototypical case p2 for D2. Then we
    compare p1 against p2.
  • We may also compare the distribution of D1
    against that of D2.
  • Such simplifications often miss the interesting
    contrast patterns.

6
Alternative names for contrast data
mining/patterns
  • Contrast data mining is related to change mining,
    difference mining, discriminator mining,
    classification rule mining,
  • Contrast patterns are related to these patterns
  • Change patterns, class based association
    rules, contrast sets, concept drift, difference
    patterns, discriminative patterns,
    (dis)similarity patterns, emerging patterns,
    gradient patterns, high confidence patterns,
    (in)frequent patterns,

7
How is contrast data mining used ?
  • Domain understanding
  • Young children with diabetes have a greater
    risk of hospital admission, compared to the rest
    of the population
  • Used for building classifiers
  • Many different techniques - to be covered later
  • Also used for weighting and ranking instances
  • Used for monitoring
  • Tell me when something unusual (unlike others
    in this class) arrives
  • Understanding can help us do prevention,
    prediction can help us do treatment. An ounce of
    prevention is worth a pound of cure!

8
Emerging Patterns
Support frequency
  • Emerging Patterns (EPs) are contrast patterns
    between two classes of data whose support changes
    significantly between the two classes.
    Significant change can be defined by
  • If supp2(X)/supp1(X) infinity, then X is a
    jumping EP.
  • jumping EP occurs in some members of one class
    but never occurs in the other class.
  • Here, X is the AND of a set of simple conditions.
    Extension to OR was also studied

similar to RiskRatio allowing patterns with
small overall support
  • big support ratio
  • supp2(X)/supp1(X) gt minRatio
  • big support difference
  • supp2(X) supp1(X) gt minDiff

(as defined by BayPazzani 99)
9
Example EP in microarray data for cancer
  • Normal Tissues Cancer Tissues
  • EP example Xg1L,g2H,g3L suppN(X)50,
    suppC(X)0
  • Use minimality to reduce number of mined EPs

binned data
g1 g2 g3 g4
L H L H
L H L L
H L L H
L H H L
g1 g2 g3 g4
H H L H
L H H H
L L L H
H H H L
genes
tissues
10
Top support minimal jumping EPs for colon cancer
These EPs have 95--100 support in one class but
0 support in the other class. Minimal Each
proper subset occurs in both classes.
  • Colon Cancer EPs
  • 1 4- 112 113 100
  • 1 4- 113 116 100
  • 1 4- 113 221 100
  • 1 4- 113 696 100
  • 1 108- 112 113 100
  • 1 108- 113 116 100
  • 4- 108- 112 113 100
  • 4- 109 113 700 100
  • 4- 110 112 113 100
  • 4- 112 113 700 100
  • 4- 113 117 700 100
  • 1 6 8- 700 97.5

Colon Normal EPs 12- 21- 35 40 137 254
100 12- 35 40 71- 137 254 100 20- 21-
35 137 254 100 20- 35 71- 137 254
100 5- 35 137 177 95.5 5- 35 137 254
95.5 5- 35 137 419- 95.5 5- 137 177
309 95.5 5- 137 254 309 95.5 7- 21- 33
35 69 95.5 7- 21- 33 69 309 95.5 7-
21- 33 69 1261 95.5
EPs from MaoDong 05 (gene club border-diff).
There are 1000 items with supp gt 80.
Colon cancer dataset (Alon et al, 1999 (PNAS))
40 cancer tissues, 22 normal tissues. 2000 genes
Very few 100 support EPs.
11
Besides uses discussed earlier, another potential
use of minimal jumping EPs
  • Minimal jumping EPs for normal tissues
  • ? Properly expressed gene groups important for
    normal cell functioning, but destroyed in all
    colon cancer tissues
  • ? Restore these ? ?cure colon cancer?
  • Minimal jumping EPs for cancer tissues
  • ? Bad gene expression groups that occur in
    some cancer tissues but never occur in normal
    tissues
  • ? Disrupt these ? ?cure colon cancer?
  • ? Possible targets for drug design ?

LiWong 02 proposed gene therapy using EP idea
Paper using EP published in Cancer Cell (cover,
3/02). EPs have been applied in medical
applications for diagnosing acute Lymphoblastic
Leukemia etc.
12
EP Mining Algorithms and Studies
  • Complexity result (Wang et al 05)
  • Border-differential algorithm (DongLi 99)
  • Gene club border differential (MaoDong 05)
  • Constraint-based approach (Zhang et al 00)
  • Tree-based approach (Bailey et al 02,
    FanKotagiri 02)
  • Projection based algorithm (Bailey el al 03)
  • ZBDD based method (LoekitoBailey 06)
  • Equivalence class based (Li et al 07).

Can handle 200 dimensions
13
Contrast pattern based classification -- history
  • Contrast pattern based classification Methods to
    build or improve classifiers, using contrast
    patterns
  • CBA (Liu et al 98)
  • CAEP (Dong et al 99)
  • Instance based method DeEPs (Li et al 00, 04)
  • Jumping EP based (Li et al 00), Information based
    (Zhang et al 00), Bayesian based (FanKotagiri
    03), improving scoring for gt3 classes (Bailey et
    al 03)
  • CMAR (Li et al 01)
  • Top-ranked EP based PCL (LiWong 02)
  • CPAR (YinHan 03)
  • Weighted decision tree (AlhammadyKotagiri 06)
  • Rare class classification (AlhammadyKotagiri 04)
  • Constructing supplementary training instances
    (AlhammadyKotagiri 05)
  • Noise tolerant classification (FanKotagiri 04)
  • One-class classification/detection of outlier
    cases (ChenDong 06)
  • Most follow the aggregating approach of CAEP.

14
EP-based classifiers rationale
  • Consider a typical EP in the Mushroom dataset,
    odor none, stalk-surface-below-ring smooth,
    ring-number one its support increases from
    0.2 from poisonous to 57.6 in edible
    (support ratio 288).
  • Strong differentiating power if a test case T
    contains this EP, we can predict T as edible with
    high confidence 99.6 57.6/(57.60.2)
  • A single EP is usually sharp in telling the class
    of a small fraction (e.g. 3) of all instances.
    Need to aggregate the power of many EPs to make
    the classification.
  • EP based classification methods often out perform
    state of the art classifiers, including C4.5 and
    SVM. They are also noise tolerant.

15
CAEP (Classification by Aggregating Emerging
Patterns)
  • Given a test case T, obtain Ts scores for each
    class, by aggregating the discriminating power of
    EPs contained in T assign the class with the
    maximal score as Ts class.
  • The discriminating power of EPs are expressed in
    terms of supports and growth rates. Prefer large
    supRatio, large support
  • The contribution of one EP X (support weighted
    confidence)

CMAR aggregates Chi2 weighted Chi2
strength(X) sup(X) supRatio(X) /
(supRatio(X)1)
  • Given a test T and a set E(Ci) of EPs for class
    Ci, the aggregate score of T for Ci is

score(T, Ci) S strength(X) (over X of Ci
matching T)
  • For each class, may use median (or 85)
    aggregated value to normalize to avoid bias
    towards class with more EPs

16
How CAEP works? An example
Class 1 (D1)
  • Given a test case Ta,d,e, how to classify T?

a c d e
a e
b c d e
b
  • T contains EPs of class 1 a,e (5025) and
    d,e (5025), so Score(T, class1)

0.50.5/(0.50.25) 0.50.5/(0.50.25)
0.67
Class 2 (D2)
  • T contains EPs of class 2 a,d (2550), so
    Score(T, class 2) 0.33
  • T will be classified as class 1 since
    Score1gtScore2

a b
a b c d
c e
a b d e
17
DeEPs (Decision-making by Emerging Patterns)
  • An instance based (lazy) learning method, like
    k-NN but does not use the normal distance
    measure.
  • For a test instance T, DeEPs
  • First project all training instances to contain
    only items in T
  • Discover EPs from the projected data
  • Use these EPs to get the training data that match
    some discovered EPs
  • Finally, use the proportional size of matching
    data in a class C as Ts score for C
  • Advantage disallow similar EPs to give duplicate
    votes!

18
Why EP-based classifiers are good
  • Use the discriminating power of low support EPs
    (with high supRatio), in addition to the high
    support ones
  • Use multi-feature conditions, not just
    single-feature conditions
  • Select from larger pools of discriminative
    conditions
  • Compare Search space of patterns for decision
    trees is limited by early greedy choices.
  • Aggregate/combine the discriminating power of a
    diversified committee of experts (EPs)
  • Decision of such classifiers is highly explainable

19
Also Studied Contrast Pattern Mining for
  • Sequence family A vs sequence family B
  • Graph collection A vs graph collection B
  • Build contrast pattern based clustering quality
    index
  • Constructing synthetic training data for classes
    with few training instances
  • More than 6 PhD dissertations
  • About 50 research papers
  • A tutorial given at IEEE ICDM 2007

20
Multi-dimensional multi-level data mining in data
cubes
  • Data cube is used for discovering patterns
    captured in consolidated historical data for a
    company/organization
  • rules, anomalies, unusual factor combinations
  • Data cube is focused on modeling analysis of
    data for decision makers, not daily operations.
  • Data organized around major subjects or factors,
    such as customer, product, time, sales.
  • Cube contains huge number of MDML sumaries for
    segments or sectors at different levels of
    details
  • Basic OLAP operations Drill down, roll up, slice
    and dice, pivot

21
Data Cubes Base Table Hierarchies
  • Base table stores sales volume (measure), a
    function of product, time, location (dimensions)

Hierarchical summarization paths
Time
Location
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
all (as top of each dimension)
a base cell
22
Data Cubes Derived Cells
Measures sum, count, avg, max, min, std,
(TV,,Mexico)
Derived cells, different levels of details
23
Gradient mining in data cubes
  • Find syntactically similar cells with
    significantly different measure values
  • EG
  • (house,California,May,2008), total-sale100M
  • vs (house,Iowa,May,2008), total-sale 200M
  • This is made up to show the point

Other people studied iceberg cubes, cells
significantly different from neighbors,
24
Multi-Dimensional Trends Analysis of Sets of
Time-Series in Data Cubes
  • Consider applications having many time series
  • ECG curves, stocks, power grids, sensor networks,
    internet, gene expressions for toxicology study,
  • Need MDML trends analysis
  • Mining/monitoring unusual patterns/events, in
    MDML manner
  • E.G. Find good sets of stocks with desired total
    risk/reward ratios
  • Regression cube for time series
  • Store regression base cube
  • Support MDML OLAP of regressions
  • Results also useful for MDML data stream
    monitoring

25
Example Aggregating Set of Time Series
  • Two component cells
  • Aggregated cell

Deriving regression of aggregated cell from
regression of component cells
Data Mining Results and Applications
Guozhu Dong
26
In-Network Detection of Shapes of Region-Based
Events in Sensor Networks
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Event Sensing
Sensor Node
Each sensor can sense events, and talk with
neighbors
27
Research Problems Studied
  • Detection of Region-Based Events given a sensor
    network, when a region-based event occurs, report
    the spatial geometric information, which may
    include
  • the boundaries and the shape of the region
  • positions of important points
  • important metrics length, area, density
  • Tracking of Region-Based Events after initial
    detection of a region-based event, determine its
    spatial dynamic parameters (moving direction,
    speed, expansion rate of area, etc).
  • Computation is done in the sensor network, which
    is organized into an R-tree.

28
Multiple platforms/labs dataset
concordance/consistency evaluation
  • Microarrays (supplied by different manufactures)
    are used to measure gene expressions in tissues,
    by different labs.
  • Without knowing the concordance between
    platform/lab conditions, it is hard to transfer
    knowledge (patterns/classifiers) from one lab to
    another
  • We provide measures and techniques to address
    this problem, based on discriminating
    gene/classifier transferability

29
Summarizing clusterings of documents
  • We often need to process large collections of
    documents (abstracts, articles, google search, )
  • We need methods to help us quickly get a sense of
    the main themes of the documents
  • We gave methods to find summary word sets
    (cluster description sets) to describe
    clusterings of documents
  • Words in a summary set for a cluster should be
    typical in the cluster, and be rare in other
    clusters

30
Alternative Clustering
  • Clustering is usually performed on poorly
    understood datasets
  • Multiple clusterings (ways to group the data) may
    exist
  • Need methods to discover alternative clusterings
  • We gave algorithms to solve this problem, and
    introduced a new similarity measure between
    clusterings

31
Undesirable object converter mining
  • We have a class of desirable objects and a class
    of undesirable objects.
  • The goal is to mine small sets of attribute
    changes, which when applied to undesirable
    objects, may change those objects class from
    undesirable to desirable.
  • We considered two types of converter sets
    personalized, and universal
  • We gave algorithms to mine them

32
Data mining for knowledge transfer
  • We have two application domains a well
    understood one and a less understood one.
  • The goal is to mine knowledge that can be
    transferred from the well understood domain to
    the less understood domain, to solve problems in
    the less understood domain

33
Comparative summary of search results
  • We often perform multiple searches on the web or
    on a document collection.
  • There is an information overload, when we process
    the search results.
  • We developed tools to compare and summarize the
    search results to reduce the information
    overload.
  • Compare two searches -- examples
  • Same key words searched at two time points
  • Same key words searched over two locations etc

34
Outline of Some Recent Works, Review
  • Contrast data mining
  • Contrast pattern based classifiers
  • Contrast pattern mining on sequence data
  • Real-time mining/analysis of sensor network data
  • Multi-dimensional multi-level data mining in data
    cubes
  • Mining large collections of time series
  • Microarray concordance analysis using contrast
    patterns
  • Summarizing clusterings of abstracts/articles
  • Alternative clustering
  • Conversion of undesirable objects
  • Data mining for knowledge transfer
  • Comparative summary of search results

35
Thank you
  • List of papers available at http//www.cs.wright.e
    du/gdong/
  • Email guozhu.dong_at_wright.edu
  • Collaboration opportunities to work on your
    problems are welcome
Write a Comment
User Comments (0)
About PowerShow.com