Probation Talk: On Mining Microarray data by OrderPreserving Submatrix - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Probation Talk: On Mining Microarray data by OrderPreserving Submatrix

Description:

... clustering methods that focus on grouping objects with similar values on a set of dimensions ... Group together data points that are close or similar to ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 30
Provided by: lche6
Category:

less

Transcript and Presenter's Notes

Title: Probation Talk: On Mining Microarray data by OrderPreserving Submatrix


1
Probation TalkOn Mining Micro-array data by
Order-Preserving Submatrix
  • Lin Cheung
  • Kevin Y. Yip
  • David W. Cheung
  • Ben Kao
  • Michael K. Ng

2
Abstract
  • Study the problem of pattern-based subspace
    clustering
  • Unlike traditional clustering methods that focus
    on grouping objects with similar values on a set
    of dimensions
  • clustering by pattern similarity finds objects
    that exhibit a coherent pattern of rises and
    falls in subspaces.
  • Applications
  • DNA micro-array data analysis
  • Automatic recommendation systems
  • Target marketing systems

3
Outline
  • Introduction
  • Related Work
  • Problem Definition
  • Algorithm
  • Experiment Results
  • Future Plan

4
Introduction
  • Invention of DNA micro-array technologies has
    revolutionized the experimental study of gene
    expression.
  • Thousands of genes are probed every day.
  • Various gene expression data analysis techniques
    have been intensively studied.

5
Introduction
  • In DNA micro-array data analysis, gene expression
    data is organized as matrices.
  • a row carries the information of a gene
  • a column represents a sample for an experiment.
  • The number in each cell records the expression
    value of a particular gene under a particular
    sample.
  • In this presentation, we will use the terms
  • object to mean a row (gene) of a dataset
  • dimension (or attribute) to mean a column
    (sample) of a dataset

(columns)
(rows)
6
Clustering
  • Clustering
  • One of the most popular methods of discovering
    useful biological insights from gene expression
    data.
  • Group together data points that are close or
    similar to each other over all dimensions into
    clusters
  • Traditional Distance-based clustering
  • Highly depends on distance (or similarity)
    measure.
  • Euclidean distance, Manhattan distance, etc

7
Subspace clustering
  • DNA micro-array data is typically
    high-dimensional
  • Clustering over the global set of attributes
    often fails to extract all meaningful clusters
  • Objects tend to exhibit strong similarity only
    over a (unknown) subset of the dimensions.
  • Subspace clustering
  • Group together data points that are close or
    similar to each other over a subset of the
    dimensions into clusters
  • Discovery of clusters that are embedded in
    certain subspaces of high-dimensional data, such
    as gene expression data

8
Pattern-based subspace clustering
  • A special type of subspace clustering that uses
    pattern similarity as a measure of object
    distances.

9
Pattern-based subspace clustering
Raw Data 3 rows and 10 columns
10
Pattern-based subspace clustering
  • In columns, b, c, f, h, i
  • The expression values of the rows follow the same
    rise-and-fall pattern
  • The 3 rows form a cluster in the subspace b, c,
    f, h, i.
  • Traditional distance functions are sometimes
    inadequate in capturing correlations among rows.

11
Pattern-based subspace clustering
  • Rearrange the columns so that expression values
    are listed in ascending order
  • Form a column sequence f, c, b, i, h
  • all increasing under the new column sequence
  • f, c, b, i, h is referred as an order-preserving
    pattern
  • The length of a pattern is the number of columns
  • A row is a supporting row if its values exhibit
    an increasing order with respect to the column
    sequence, where support refers to the number of
    supporting rows
  • An order-preserving pattern together with the set
    of supporting rows form an Order Preserving
    Sub-matrix, or OPSM.

12
Challenges
  • Computationally challenging problem
  • Complexity lies in the requirement of
    simultaneously determining both cluster members
    and relevant dimensions.
  • Number of potential order-preserving patterns
    grows exponentially with respect to the number of
    attributes.
  • A dataset D with n attributes has ?in!/(n-i)!
    potential order-preserving patterns
  • There could be tens to hundreds of attributes in
    DNA micro-array datasets
  • Number of potential OPSMs are huge
  • a set of rows R and a set of columns C form a
    valid OPSM
  • Any subset of those rows plus any subset of those
    columns form a valid OPSM too
  • FOCUS
  • Mine ALL maximal OPSMs only

13
Related Work
  • OPSM model, Ben-Dor et al. RECOMB 2002 3
  • Proposed a partial model returns k good
    quality OPSMs

14
Related Work
  • d-pCluster, Wang et al, SIGMOD 2002, 11
  • Only can discover shift-pattern and
    scaling-pattern

15
Formal description of the OPSMproblem
  • Consider a gene-expression dataset D, represented
    as a matrix.
  • We use O and C to denote the set of rows and
    columns in D, respectively.
  • We use di,j to denote the entry of D in row i and
    column j.

16
Formal description of the OPSMproblem
  • A cluster S is a submatrix of D formed by a
    subset of nS( 2) rows and a subset of mS( 2)
    columns of D.
  • Rows and columns in S need not be contiguous in
    D. The rows in S are referenced by their row
    indices in D, each of which is a distinct integer
    in 1, 2, ..., nD. The set of row indices of S
    is denoted as RS.
  • Columns in S are similarly referenced. The set of
    columns in S is denoted by CS.

17
Formal description of the OPSMproblem
  • The columns in S are enclosed in curly brackets,
    e.g., CS c1, c2, ..., cmS . A sequence CS of
    the columns in S is enclosed in angled brackets,
    e.g., CS lt c1, c2, ..., cmS gt. The columns in a
    sequence is totally ordered.
  • For the basic OPSM problem, a cluster is a set of
    rows and a set of columns such that entries in
    every row are increasing w.r.t. a particular
    column sequence.
  • A cluster S is thus written as S (RS, CS).

18
Problem Definition
  • Definition 1 A cluster S is an OPSM if there
    exists a sequence of columns such that in the
    cluster S, si,j si,j1 for all i ? 1, 2, ...,
    nS and all j ? 1, 2, ...,mS-1.
  • A cluster S is a subcluster of a cluster S' if RS
    ? RS and CS ? CS. A cluster S is a proper
    subcluster of a cluster S' if S is a subcluster
    of S' and either RS ? RS or CS ? CS .
  • Definition 2 An OPSM is a maximal OPSM if it is
    not a proper subcluster of any OPSM.

19
Problem Definition
  • (Maximal size-constrained OPSM problem) Given a
    data matrix D, a supporting row thresh old nmin,
    and a column threshold mmin, find all maximal
    OPSMs S in D such that nS nmin and mS mmin.

20
Algorithm
  • Devised an apriori-like algorithm to solve the
    problem
  • Invented a new data structure
  • Head-Tail Trees
  • Efficient processing of
  • identifying all pairs of column sequences with
    length k where the first k - 1 indices of the
    first clusters column sequence equal the last k
    - 1 indices of the second clusters column
    sequence, and
  • Intersecting two sets of row indices

21
Algorithm
Given a dataset D, nmin 2 and mmin 2
22
Algorithm
The head tree for clusters with 2 columns
The tail tree for clusters with 2 columns
23
Algorithm
The head tree for clusters with 3 columns
The tail tree for clusters with 3 columns
24
Algorithm
  • Property 1 (A priori property) A cluster S is an
    OPSM if and only if all proper subclusters of S
    are also OPSM.
  • Property 2 (Transitivity) If S1 (RS1
    ,ltx1,x2,...,xi ,y1,y2,...,yj gt) and S2 (RS2 ,lt
    y1,y2,...,yj ,z1,z2,...,zk gt) are two maximal
    OPSMs and RS1 n RS2 contains nmin or more
    indices, then S (RS1 n RS2 , lt x1,x2,...,xi ,
    y1,y2,...,yj ,z1,z2,...,zk gt) is a maximal OPSM.

25
Experiments
  • Yeast Dataset
  • Used in Y. Cheng and G.M. Church. Biclustering of
    expression data. ISMB, 2000.
  • 2884 rows x 17 columns
  • 54 gene groups
  • Pick a gene group and put all the genes of the
    gene group in a cluster
  • Find all dimension pairs along which the
    expression values of all the genes move in the
    same direction
  • Merge the results in step 2 to find OPSMs with
    more than 2 dimensions
  • Repeat 1

26
Scalability with respect to number of columns
Number of rows 1000, Nmin 10, mmin 2
27
Scalability with respect to number of rows
Number of columns 10, Nmin 10, mmin 2
28
Future Work
  • Uncertainty, experimental errors exist in
    micro-array datasets
  • Mine error-tolerated OPSMs

29
References
  • 1 R. Agrawal and R. Srikant. Fast algorithms
    for mining association rules in large databases.
    VLDB, 1994.
  • 2 Y. Cheng and G.M. Church. Biclustering of
    expression data. ISMB, 2000.
  • 3 A. B.-D. et. al. Discovering local structure
    in gene expression data the order-preserving
    submatrix problem. RECOMB, 2002.
  • 4 J. P. et. al. MaPle A fast algorithm for
    maximal patternbased clustering. ICDM, 2003.
  • 5 J. P. B. et. al. Significance and statistical
    errors in the analysis of dna microarray data.
    PNAS, 2002.
  • 6 Y. K. et. al. Spectral biclustering of
    microarray cancer data Co-clustering genes and
    conditions. Genome Research, 13(4)703716, 2003.
  • 7 L. Lazzeroni and A. Owen. Plaid models for
    gene expression data. Statistica Sinica,
    126186, 2002.
  • 8 J. Liu and W. Wang. Op-cluster Clustering by
    tendency in high dimensional space. ICDM, 2003.
  • 9 J. Liu, J. Yang, and W. Wang. Biclustering in
    gene expression data by tendency. CSB, 2004.
  • 10 H. Wang, F. Chu, W. Fan, P. Yu, and J. Pei.
    A fast algorithm for subspace clustering by
    pattern similarity. SSDBM, 2004.
  • 11 H. Wang, W. Wang, J. Yang, and P. S. Yu.
    Clustering by pattern similarity in large data
    sets. SIGMOD, 2002.
  • 12 J. Yang, H. Wang, W. Wang, and P. Yu.
    Enhanced biclustering on expression data. BIBE,
    2003.
Write a Comment
User Comments (0)
About PowerShow.com