Probation Talk: On Mining Microarray data by OrderPreserving Submatrix

About This Presentation

Title:

Probation Talk: On Mining Microarray data by OrderPreserving Submatrix

Description:

... clustering methods that focus on grouping objects with similar values on a set of dimensions ... Group together data points that are close or similar to ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 30

Provided by: lche6

Category:

more less

Transcript and Presenter's Notes

Title: Probation Talk: On Mining Microarray data by OrderPreserving Submatrix

1
Probation TalkOn Mining Micro-array data by
Order-Preserving Submatrix

Lin Cheung
Kevin Y. Yip
David W. Cheung
Ben Kao
Michael K. Ng

2
Abstract

Study the problem of pattern-based subspace
clustering
Unlike traditional clustering methods that focus
on grouping objects with similar values on a set
of dimensions
clustering by pattern similarity finds objects
that exhibit a coherent pattern of rises and
falls in subspaces.
Applications
DNA micro-array data analysis
Automatic recommendation systems
Target marketing systems

3
Outline

Introduction
Related Work
Problem Definition
Algorithm
Experiment Results
Future Plan

4
Introduction

Invention of DNA micro-array technologies has
revolutionized the experimental study of gene
expression.
Thousands of genes are probed every day.
Various gene expression data analysis techniques
have been intensively studied.

5
Introduction

In DNA micro-array data analysis, gene expression
data is organized as matrices.
a row carries the information of a gene
a column represents a sample for an experiment.
The number in each cell records the expression
value of a particular gene under a particular
sample.
In this presentation, we will use the terms
object to mean a row (gene) of a dataset
dimension (or attribute) to mean a column
(sample) of a dataset

(columns)
(rows)
6
Clustering

Clustering
One of the most popular methods of discovering
useful biological insights from gene expression
data.
Group together data points that are close or
similar to each other over all dimensions into
clusters
Traditional Distance-based clustering
Highly depends on distance (or similarity)
measure.
Euclidean distance, Manhattan distance, etc

7
Subspace clustering

DNA micro-array data is typically
high-dimensional
Clustering over the global set of attributes
often fails to extract all meaningful clusters
Objects tend to exhibit strong similarity only
over a (unknown) subset of the dimensions.
Subspace clustering
Group together data points that are close or
similar to each other over a subset of the
dimensions into clusters
Discovery of clusters that are embedded in
certain subspaces of high-dimensional data, such
as gene expression data

8
Pattern-based subspace clustering

A special type of subspace clustering that uses
pattern similarity as a measure of object
distances.

9
Pattern-based subspace clustering
Raw Data 3 rows and 10 columns
10
Pattern-based subspace clustering

In columns, b, c, f, h, i
The expression values of the rows follow the same
rise-and-fall pattern
The 3 rows form a cluster in the subspace b, c,
f, h, i.
Traditional distance functions are sometimes
inadequate in capturing correlations among rows.

11
Pattern-based subspace clustering

Rearrange the columns so that expression values
are listed in ascending order
Form a column sequence f, c, b, i, h
all increasing under the new column sequence
f, c, b, i, h is referred as an order-preserving
pattern
The length of a pattern is the number of columns
A row is a supporting row if its values exhibit
an increasing order with respect to the column
sequence, where support refers to the number of
supporting rows
An order-preserving pattern together with the set
of supporting rows form an Order Preserving
Sub-matrix, or OPSM.

12
Challenges

Computationally challenging problem
Complexity lies in the requirement of
simultaneously determining both cluster members
and relevant dimensions.
Number of potential order-preserving patterns
grows exponentially with respect to the number of
attributes.
A dataset D with n attributes has ?in!/(n-i)!
potential order-preserving patterns
There could be tens to hundreds of attributes in
DNA micro-array datasets
Number of potential OPSMs are huge
a set of rows R and a set of columns C form a
valid OPSM
Any subset of those rows plus any subset of those
columns form a valid OPSM too
FOCUS
Mine ALL maximal OPSMs only

13
Related Work

OPSM model, Ben-Dor et al. RECOMB 2002 3
Proposed a partial model returns k good
quality OPSMs

14
Related Work

d-pCluster, Wang et al, SIGMOD 2002, 11
Only can discover shift-pattern and
scaling-pattern

15
Formal description of the OPSMproblem

Consider a gene-expression dataset D, represented
as a matrix.
We use O and C to denote the set of rows and
columns in D, respectively.
We use di,j to denote the entry of D in row i and
column j.

16
Formal description of the OPSMproblem

A cluster S is a submatrix of D formed by a
subset of nS( 2) rows and a subset of mS( 2)
columns of D.
Rows and columns in S need not be contiguous in
D. The rows in S are referenced by their row
indices in D, each of which is a distinct integer
in 1, 2, ..., nD. The set of row indices of S
is denoted as RS.
Columns in S are similarly referenced. The set of
columns in S is denoted by CS.

17
Formal description of the OPSMproblem

The columns in S are enclosed in curly brackets,
e.g., CS c1, c2, ..., cmS . A sequence CS of
the columns in S is enclosed in angled brackets,
e.g., CS lt c1, c2, ..., cmS gt. The columns in a
sequence is totally ordered.
For the basic OPSM problem, a cluster is a set of
rows and a set of columns such that entries in
every row are increasing w.r.t. a particular
column sequence.
A cluster S is thus written as S (RS, CS).

18
Problem Definition

Definition 1 A cluster S is an OPSM if there
exists a sequence of columns such that in the
cluster S, si,j si,j1 for all i ? 1, 2, ...,
nS and all j ? 1, 2, ...,mS-1.
A cluster S is a subcluster of a cluster S' if RS
? RS and CS ? CS. A cluster S is a proper
subcluster of a cluster S' if S is a subcluster
of S' and either RS ? RS or CS ? CS .
Definition 2 An OPSM is a maximal OPSM if it is
not a proper subcluster of any OPSM.

19
Problem Definition

(Maximal size-constrained OPSM problem) Given a
data matrix D, a supporting row thresh old nmin,
and a column threshold mmin, find all maximal
OPSMs S in D such that nS nmin and mS mmin.

20
Algorithm

Devised an apriori-like algorithm to solve the
problem
Invented a new data structure
Head-Tail Trees
Efficient processing of
identifying all pairs of column sequences with
length k where the first k - 1 indices of the
first clusters column sequence equal the last k
- 1 indices of the second clusters column
sequence, and
Intersecting two sets of row indices

21
Algorithm
Given a dataset D, nmin 2 and mmin 2
22
Algorithm
The head tree for clusters with 2 columns
The tail tree for clusters with 2 columns
23
Algorithm
The head tree for clusters with 3 columns
The tail tree for clusters with 3 columns
24
Algorithm

Property 1 (A priori property) A cluster S is an
OPSM if and only if all proper subclusters of S
are also OPSM.
Property 2 (Transitivity) If S1 (RS1
,ltx1,x2,...,xi ,y1,y2,...,yj gt) and S2 (RS2 ,lt
y1,y2,...,yj ,z1,z2,...,zk gt) are two maximal
OPSMs and RS1 n RS2 contains nmin or more
indices, then S (RS1 n RS2 , lt x1,x2,...,xi ,
y1,y2,...,yj ,z1,z2,...,zk gt) is a maximal OPSM.

25
Experiments

Yeast Dataset
Used in Y. Cheng and G.M. Church. Biclustering of
expression data. ISMB, 2000.
2884 rows x 17 columns
54 gene groups
Pick a gene group and put all the genes of the
gene group in a cluster
Find all dimension pairs along which the
expression values of all the genes move in the
same direction
Merge the results in step 2 to find OPSMs with
more than 2 dimensions
Repeat 1

26
Scalability with respect to number of columns
Number of rows 1000, Nmin 10, mmin 2
27
Scalability with respect to number of rows
Number of columns 10, Nmin 10, mmin 2
28
Future Work

Uncertainty, experimental errors exist in
micro-array datasets
Mine error-tolerated OPSMs

29
References

1 R. Agrawal and R. Srikant. Fast algorithms
for mining association rules in large databases.
VLDB, 1994.
2 Y. Cheng and G.M. Church. Biclustering of
expression data. ISMB, 2000.
3 A. B.-D. et. al. Discovering local structure
in gene expression data the order-preserving
submatrix problem. RECOMB, 2002.
4 J. P. et. al. MaPle A fast algorithm for
maximal patternbased clustering. ICDM, 2003.
5 J. P. B. et. al. Significance and statistical
errors in the analysis of dna microarray data.
PNAS, 2002.
6 Y. K. et. al. Spectral biclustering of
microarray cancer data Co-clustering genes and
conditions. Genome Research, 13(4)703716, 2003.
7 L. Lazzeroni and A. Owen. Plaid models for
gene expression data. Statistica Sinica,
126186, 2002.
8 J. Liu and W. Wang. Op-cluster Clustering by
tendency in high dimensional space. ICDM, 2003.
9 J. Liu, J. Yang, and W. Wang. Biclustering in
gene expression data by tendency. CSB, 2004.
10 H. Wang, F. Chu, W. Fan, P. Yu, and J. Pei.
A fast algorithm for subspace clustering by
pattern similarity. SSDBM, 2004.
11 H. Wang, W. Wang, J. Yang, and P. S. Yu.
Clustering by pattern similarity in large data
sets. SIGMOD, 2002.
12 J. Yang, H. Wang, W. Wang, and P. Yu.
Enhanced biclustering on expression data. BIBE,
2003.

Write a Comment

User Comments (0)

About PowerShow.com

Probation Talk: On Mining Microarray data by OrderPreserving Submatrix - PowerPoint PPT Presentation

Probation Talk: On Mining Microarray data by OrderPreserving Submatrix

... clustering methods that focus on grouping objects with similar values on a set of dimensions ... Group together data points that are close or similar to ... – PowerPoint PPT presentation