Title: Mining Order Preserving Submatrices OPSMs from data with replicates
1Mining Order Preserving Submatrices (OPSMs) from
data with replicates
- Presenter Chun-Kit Chui
- Supervisor Dr. Ben Kao
2Presentation Outline
- Conventional Order Preserving Submatrixes
- Multiple-value matrix data model
- Mining OPSMs from the new data model
- Efficient methods bounding techniques
- Experimental evaluation
3Order Preserving Submatrix
- The Order Preserving Submatrix is a pattern-based
subspace clustering problem which usually applies
on mining gene expression datasets. - Gene expression dataset
- One of the goals of microarray data analysis is
to group the coexpressing genes into a cluster. - Genes that shows similar changes of expression
levels (up or down of expression values) under
some environmental stimuli (experimental
conditions).
Experimental conditions (Experimental settings)
Expression value of a gene under an experimental
setting.
Genes
4Order Preserving Submatrix
No obvious patterns observed.
Data matrix plotted
Raw gene expression dataset
- Genes expression level may vary substantially
due to its sensitivity to experimental settings. - i.e. the change of expression level in response
to the change of experimental condition is often
considered more meaningful than its actual value. - In microarray experiments, the design of
experimental conditions is often based on little
knowledge of gene functions. - i.e. clustering algorithms should consider a
subset of conditions which maximizes the
similarities among a subset of genes.
5Order Preserving Submatrix
No obvious patterns observed.
Data matrix plotted
Raw gene expression dataset
Consider a subset of experimental conditions and
a subset of genes and we reorder the columns.
Reordered subset of columns (experimental
conditions)
Order Preserving Submatrix
The change of expression values of the genes in
response to the change of experimental condition
is the same. They are all increasing.
Subset of Genes
Identifying this submatrix (subset of genes and
conditions and the ordering of conditions) is
particularly useful for the biologists. E.g.
infer gene regulatory networks.
6Order Preserving Submatrix
- Given a data matrix M with n rows and m columns.
- An order preserving submatrix S is
- A subset of row R.
- E.g. RG1,G4,G5
- A subset of column C.
- E.g. CC1,C2,C3,C4,C5,C6
- A column order constraint, s.t. the entries of
all rows in R are increasingly ordered. - E.g. ltC3,C5,C4,C2,C1,C6gt
- Mining OPSMs Find all OPSMs with R greater
than or equal to a user specified threshold
(frequent) and C greater than or equal to c.
Raw gene expression dataset
Order Preserving Submatrix
7Mining OPSMs
Sequential pattern (OPSM)
Transformed sequence dataset
Raw data matrix
An OPSM
- Mining OPSMs can be reduced to a special case of
sequential pattern mining. - Transform the data matrix to a sequence dataset
by sorting each row in ascending order and
replace the entries with the corresponding column
labels. - Each sequential pattern uniquely specifies an
OPSM, with all the supporting sequences as the
supporting rows. - A row supports an OPSM if the order constraint of
the OPSM is a subsequence of the transformed
column sequence of the row.
8Mining OPSMs Two properties
Transformed sequence dataset
- Apriori property
- If the OPSM with column order constraint lta,bgt is
infrequent (rows smaller than a user specified
support threshold), the OPSMs with column order
constraint as its superset (e.g. lta,b,cgt,
lta,b,c,dgt, ltc,a,b,dgt) are all infrequent. - E.g. OPSM with column order constraint
ltC8,C3,C5,C4,C2gt is supported by G1 only, it is
infrequent. - Adding more constraints to the OPSM will only
reduce the number of supporting sequences. - OPSM ltC8,C3,C5,C4,C2,C1gt must be infrequent.
- According to the Apriori property, we can have an
iterative method to prune the search space.
i.e. Mine the frequent size-k OPSMs, identify
infrequent size-k1 OPSMs and prune them,
continue to the next (k1) iteration.
9Mining OPSMs Two properties
Transformed sequence dataset
- Transitivity
- If the column order constraint of an OPSM O1 is
ltx1,x2,,xi,y1,y2,,yjgt and another OPSM O2 is
lty1,y2,,yj,z1,z2,.,zkgt. Then the intersection
of R1 and R2 yields the set of supporting rows
for OPSM O3 with column order constraint
lt
x1,x2,,xi,y1,y2,, yj,z1,z2,.,zkgt. - E.g. OPSM ltC3,C5,C7gt is supported by G1, G4, OPSM
ltC5,C7,C1gt is supported by G1. - OPSM ltC3,C5,C7,C1gt is supported by G1.
- According to the Transitivity property, we can
obtain the supports of size-(k1) OPSMs from
size-k frequent OPSMs without rescan the sequence
dataset.
10Mining OPSMs
Those OPSMs with supporting rows greater than or
equal to the support threshold are frequent, they
are passed into the OPSM-Gen procedure.
The Subsequence function verifies the supporting
rows of the candidate OPSMs.
Start from mining size-2 OPSMs because size-1
OPSMs does not have any orderings.
Size-2 OPSMs candidates
OPSM-Gen
The algorithm terminates when no more candidates
are generated.
According to the Transitivity property, we can
obtain the supporting rows of the size k1
candidates from the size k large OPSMs. No need
to scan the dataset.
According to the Apriori property, we only
generate those size k1 candidates with all
proper subsets being frequent.
11Mining OPSMs Data Structure
Subsequence Function
OPSM-Gen
A Head-tail tree data structure was proposed to
facilitate candidate generation in the OPSM-Gen
procedure.
Size-3 frequent OPSMs table
To add an OPSM lta,b,cgt into the Head-tail tree,
we first follow the head of the OPSM (i.e.
lta,bgt) to traverse the tree and store the OPSM in
the leaf node, we indicate it reaches the leaf
node by following the head (H) of the OPSM.
Then, we follow the tail of the OPSM (i.e.
ltb,cgt) to traverse the tree and store the OPSM in
the leaf node, indicate it reaches the leaf node
by following the tail (T) of the OPSM.
In the OPSM-Gen procedure, we have the size-k
frequent OPSMs table storing the column order
constraint (OPSMs) and the support rows (bit
vectors/ tid-lists) of the OPSMs.
12Mining OPSMs Data Structure
Subsequence Function
OPSM-Gen
According to the Transitivity property, to
generate the size-4 candidates, we can simply
merge the tail OPSMs and head OPSMs within each
leaf node.
Size-3 frequent OPSMs table
13Mining OPSMs Data Structure
OPSM-Gen
Subsequence Function
A Head-tail tree data structure was proposed to
facilitate candidate generation in the OPSM-Gen
procedure.
According to the Transitivity property, to
generate the size-4 candidates, we can simply
merge the tail OPSMs and head OPSMs within each
leaf node.
Size-3 frequent OPSMs table
For example, Tail ltc,a,bgt and Head lta,b,dgt can be
merged to form a new size-4 OPSM ltc,a,b,dgt.
Follow their ptrs, we can find their supporting
rows. By the Transitivity property, OPSM
ltc,a,b,dgt is supported by row 4 and 5
(intersection of the two support row vectors).
14The replicated data model
15Multiple-value matrix data model
- Recently, researches in microarray data analysis
have shown that any single microarray output is
subject to substantial variability. - Stefan Bleuler et al, Evo Workshops 2005 R.
Coombes et al, Journal of Computational Biology
2002 G. C. Tseng et al, Nucleic Acids Research.
2001 J. P. Brody et al., National Academy of
Sciences.,2002 - The error of the expression values of the genes
under an experimental condition can be large. - Replication is strongly supported by biologists
as a straightforward approach for improving the
quality of inferences made from experimental
studies. - T.-K. Jenssen et al, Nucleic Acids Research 2002
M.-L. T. Lee et al, PNAS 2000 J. Novak et al,
Genomics,2002 R. Ramakrishnan et al, Nucleic
Acids Research 2002
16Multiple-value matrix data model
- The practice of conducting repeated experiments
is stressed in many literatures on microarray
studies. - A study on the effect of repeated measures on the
detection of differentially expressed genes has
reported that stable results are typically not
obtained until at least five biological
replicates have been used. P. Pavlidis et al,
Bioinformatics 2003 - Another study on variability analysis of gene
expression data suggests that at least three
repeated experiments should be conducted instead
of one. M.-L. T. Lee et al, PNAS 2000 - Therefore, it is necessary to consider the data
outputted by the repeated experiments when
analyzing the gene expression data.
17Multiple-value matrix data model
With repeated experiments, the data outputted by
microarray experiments can be organized as a
matrix in which each entry is a set of expression
levels of a gene under an experimental condition.
There are 3 repeated experiments conducted under
experimental condition (column) C1, the
expression value of gene (row) G1 in the first,
second and third repeated experiments
(replicates) are 23, 24 and 22 respectively.
18Multiple-value matrix data model
- Which OPSM does G2 supports?
Lets consider this set of replicates.
Since the expression values of G2 in column
ltC6,C4,C1,C3,C5,C7,C8,C2gt are increasingly
ordered, we say that the OPSM with column order
constraint ltC6,C4,C1,C3,C5,C7,C8,C2gt is one of
the OPSM that is possibly supported by G2.
Expression values
lt25, 26, 27, 31, 36, 37, 40, 45gt
Expression values
lt25, 26, 27, 31, 36, 37, 41, 45gt
There are two enumerated column orderings deduced
from G2 which conform to the column order
constraint of this OPSM.
19Multiple-value matrix data model
- Which OPSM does G2 supports?
There are six enumerated column orderings deduced
from G2 which conform to the column order
constraint of this OPSM.
Expression values
lt22, 27, 30, 31, 33, 36, 43, 45gt
20Scoring Model
Which OPSM, ltC1,C2gt or ltC2,C1gt does G1 supports?
Raw Dataset
Enumerated column orderings table
Transformed Sequence Dataset
Similar to the conventional OPSM mining, we can
transform the raw dataset to a sequence dataset.
From the raw dataset, we can enumerate all the
possible column orderings and store them in the
Enumerated column orderings table. In this case,
there are 16 enumerated column orderings in total.
Subsequence Matches
We define the score given by a row (gene) to an
OPSM being the fraction of all the enumerated
column orderings which conform to the column
order constraint of the OPSM.
Enumerated column orderings counts
The denominator of the score function can be
calculated by multiplying the number of
replicates of the columns involved. i.e. 4416.
ltC1,C2gt has 11 out of 16 of the enumerated column
orderings, the OPSM with column order constraint
ltC1,C2gt scores 11/16 from G1.
21Scoring Model
To mine the OPSMs under the scoring model, each
supporting row (gene) is associated with the
subsequence matches of the column order
constraint (OPSM).
Size-3 frequent OPSMs table
From the subsequence matches, we can calculate
the score contributed by each row (gene) and the
total sum of the scores obtained for the OPSM.
Here, we use the total score as the support
measure of the OPSMs. Those OPSMs with total
scores over a user-specified threshold are
regarded as frequent.
22Mining OPSMs from multiple-value matrix
Raw Dataset
An example raw dataset with 3 conditions
(columns), and each condition has 4 repeated
experiments (replicates).
Transformed Sequence Dataset
We transform the raw dataset into a sequence
dataset by sorting the entries in ascending order
and replace the entries with their condition
(column) IDs.
Subsequence Matches
From the row 1 sequence, we found that ltC1,C2gt
has 11 subsequence matches in row 1. Similarly,
ltC2,C3gt has 10 subsequence matches in row 1.
23Mining OPSMs from multiple-value matrix
Raw Dataset
Transformed Sequence Dataset
Size-2 frequent OPSMs table
Subsequence Matches
24Mining OPSMs from multiple-value matrix
Raw Dataset
Subsequence Function
OPSM-Gen
In the OPSM-gen procedure, OPSMs are organized in
a Head-Tail tree data structure.
Head-Tail Tree
Transformed Sequence Dataset
root
Size-2 frequent OPSMs table
2
1
3
Subsequence Matches
25Mining OPSMs from multiple-value matrix
Raw Dataset
OPSM-Gen
Subsequence Function
According to the transitivity property, Tail
ltC1,C2gt and Head ltC2,C3gt can be merged to form a
size-3 OPSM.
Head-Tail Tree
Transformed Sequence Dataset
root
Size-2 frequent OPSMs table
2
1
3
Subsequence Matches
Recall that in conventional OPSM mining, we can
deduce the support of ltC1,C2,C3gt from the size-2
frequent OPSM table by intersecting the
supporting rows s.t. we do not need to rescan the
dataset.
Question Can we deduce the subsequence matches
(score) of ltC1,C2,C3gt in Row 1 from the size-2
frequent OPSMs table?
26Mining OPSMs from multiple-value matrix
Raw Dataset
OPSM-Gen
Subsequence Function
Question Can we materialize these tables to
facilitate the joining?
Head-Tail Tree
root
Enumerated ordering tables (size-2) for row 1
Size-2 frequent OPSMs table
2
1
3
Essentially, obtaining the subsequence matches
of ltC1,C2,C3gt is equivalent to perform a join on
the column C2 of the two tables.
Since the joining information cannot be deduced
from the count (subsequence matches), we cannot
obtain the subsequence matches of ltC1,C2,C3gt
without revisiting the sequence dataset.
27Mining OPSMs from multiple-value matrix
Compress the sequence dataset to reduce the
effort for obtaining the subsequence matches
(tree traversal).
Organize the candidates in a prefix tree and
verify the subsequence matches in a single scan
over the dataset.
Reduce the candidates through some bounding
techniques.
Frequent size-k OPSMs
Size-2 OPSMs candidates
OPSM-Gen
Subsequence Function
Size k1 Candidate OPSMs
Combinatorial explosion of the number of
candidates.
Unlike the conventional OPSM mining, we have to
revisit the dataset to obtain the subsequence
matches for the candidates.
Obtain the subsequence matches of a candidate
requires enumeration of the column orderings,
which is exponential to the size of the candidate
OPSMs. Same process has to be repeated for all
rows.
28Min upper bound
This is an upper bound of the subsequence
matches of ltC1,C2,C3gt in row 1. If we apply this
bound on all the rows, we can obtain an upper
bound of the score of an OPSM.
Head-Tail Tree
root
Size-2 frequent OPSMs table
2
1
3
- Motivating questions
- Assume the replicates of all columns are 4.
- We have 11 subsequences ltC1,C2gt in row 1, and
there are 4 C3s, the maximum possible
subsequence matches of ltC1,C2,C3gt in row 1 is - We have 10 subsequences ltC2,C3gt in row 1, and
there are 4 C1s, the maximum possible
subsequence matches of ltC1,C2,C3gt in row 1 is - Therefore, the upper bound of the possible
subsequence matches of ltC1,C2,C3gt in row 1 is
44
We assume all the 4C3s are on the right of the
11 subsequences ltC1,C2gt. Therefore we guess the
maximum possible subsequence matches of
ltC1,C2,C3gt is 114 44.
40
40
29HT arrays
How many C2s after the 1st C1?
How many C2s after the 2nd C1?
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
Construct a T-array for the tail OPSM ltC1,C2gt.
30HT arrays
How many C2s after the 1st C1?
How many C2s after the 2nd C1?
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
How many C3s after the 1st C2?
How many C3s after the 2nd C2?
31HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
32HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
So we can conclude that there are 4 ltC1,C2,C3gt
orderings formed by the 1st C1 and the 1st C2.
There are 4 C2s after the 1st C1.
There are 4 C3s after the 1st C2.
4
Subsequence matches of ltC1,C2,C3gt
33HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
Therefore we can conclude that there are 3
ltC1,C2,C3gt orderings formed by the 1st C1 and the
2nd C2.
There are 4 C2s after the 1st C1.
There are 3 C3s after the 2nd C2.
4
3
Subsequence matches of ltC1,C2,C3gt
34HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
There are 4 C2s after the 1st C1.
There are 2 C3s after the 3rd C2.
4
3
2
Subsequence matches of ltC1,C2,C3gt
35HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
There are 4 C2s after the 1st C1.
There is 1 C3 after the 4th C2.
4
3
2
1
Subsequence matches of ltC1,C2,C3gt
36HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
Similar for the 2nd C1. There are 4 C2s after
the 2nd C1.
4
3
2
1
Subsequence matches of ltC1,C2,C3gt
37HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
Similar for the 2nd C1. There are 4 C2s after
the 2nd C1.
So we can sum all the 4 entries of the H-array.
4
3
2
1
10
Subsequence matches of ltC1,C2,C3gt
38HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
There are 2 C2s after the 3rd C1, which slots
of the H-array should we sum up?
4
3
2
1
10
Subsequence matches of ltC1,C2,C3gt
39HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
Since there are only 2 C2s after the 3rd C1,
the 2 C2s must be the 3rd and 4th C2s.
Otherwise, T-array3 will not be 2.
There are 2 C2s after the 3rd C1, which slots
of the H-array should we sum up?
4
3
2
1
10
2 1
Subsequence matches of ltC1,C2,C3gt
40HT arrays
With these two arrays, we can deduce the
subsequence matches of ltC1,C2,C3gt.
Raw Dataset
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
Transformed Sequence Dataset
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
Since there is only 1 C2 after the 4th C1,
the C2 must be the 4th C2. Otherwise,
T-array4 will not be 1.
Finally, there is 1 C2 after the 4th C1.
Subsequence matches of ltC1,C2,C3gt
4
3
2
1
10
2 1
1
24
Finally, we can deduce that the subsequence
matches of ltC1,C2,C3gt from row 1 is 24.
41HT arrays
Head-Tail Tree
root
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
2
1
1 2 3 4
H-array
4
3
2
1
ltC2,C3gt
Generalized HT-arrays
Can we store the HT-arrays instead of the
subsequence matches s.t. we dont need to rescan
the dataset in the subsequence function procedure?
1 2 3 4
However, the number of slots of H-array is
exponential to the number of columns in the OPSMs.
T-array
ltC1,C2, ,Cx-1gt
1 2 3
H-array
ltC2, ,Cx-1, Cxgt
42HT upper bound
This slot indicate how many C2s after the 1st
C1, therefore its value cannot be larger than
replicates of C2.
Head-Tail Tree
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
ltC2,C3gt
To obtain an upper bound of the subsequence
matches of ltC1,C2,C3gt, we try to guess the
T-array s.t. the subsequence matches of
ltC1,C2,C3gt is maximum. This can be done by
assigning the C1s to the left in the column
sequence as much as possible.
Motivation We can obtain the bound of the
subsequence matches of ltC1,C2,C3gt by guessing
the HT-arrays from the subsequence matches of
the tail and head OPSMs.
43HT upper bound
This slot indicate how many C2s after the 1st
C1, therefore its value cannot be larger than
replicates of C2.
Head-Tail Tree
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
ltC2,C3gt
Push right
To obtain an upper bound of the subsequence
matches of ltC1,C2,C3gt, we try to guess the
T-array s.t. the subsequence matches of
ltC1,C2,C3gt is maximum. This can be done by
assigning the C1s to the left in the column
sequence as much as possible.
Motivation We can obtain the bound of the
subsequence matches of ltC1,C2,C3gt by guessing
the HT-arrays from the subsequence matches of
the tail and head OPSMs.
For the H-array, we assign the C3s to the right
in the column sequence as much as possible.
44HT upper bound
This slot indicate how many C2s after the 1st
C1, therefore its value cannot be larger than
replicates of C2.
Head-Tail Tree
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
The H-array cannot be
. If there are no C3 after the 1st C2, then
there will not be any C3 after the 2nd, 3rd and
4th C2. Therefore, there is a constraint when
assigning the value to the HT arrays Array x
gt Array x1.
Motivation We can obtain the bound of the
subsequence matches of ltC1,C2,C3gt by guessing
the HT-arrays from the subsequence matches of
the tail and head OPSMs.
For the H-array, we assign the C3s to the right
in the column sequence as much as possible.
45HT upper bound
Head-Tail Tree
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Follow the previous algorithm, we can obtain the
upper bound of the subsequence matches of
ltC1,C2,C3gt from the two guessed HT-arrays.
Upper bound of the Subsequence matches of
ltC1,C2,C3gt
46HT upper bound
Head-Tail Tree
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Follow the previous algorithm, we can obtain the
upper bound of the subsequence matches of
ltC1,C2,C3gt from the two guessed HT-arrays.
Upper bound of the Subsequence matches of
ltC1,C2,C3gt
3 3 2 2
47HT upper bound
Head-Tail Tree
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Follow the previous algorithm, we can obtain the
upper bound of the subsequence matches of
ltC1,C2,C3gt from the two guessed HT-arrays.
Upper bound of the Subsequence matches of
ltC1,C2,C3gt
3 3 2 2
3 3 2 2
48HT upper bound
Head-Tail Tree
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Follow the previous algorithm, we can obtain the
upper bound of the subsequence matches of
ltC1,C2,C3gt from the two guessed HT-arrays.
Upper bound of the Subsequence matches of
ltC1,C2,C3gt
3 3 2 2
3 3 2 2
3 2 2
49HT upper bound
Head-Tail Tree
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Follow the previous algorithm, we can obtain the
upper bound of the subsequence matches of
ltC1,C2,C3gt from the two guessed HT-arrays.
Upper bound of the Subsequence matches of
ltC1,C2,C3gt
3 3 2 2
3 3 2 2
3 2 2
50HT upper bound
Head-Tail Tree
Upper bound 27
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Follow the previous algorithm, we can obtain the
upper bound of the subsequence matches of
ltC1,C2,C3gt from the two guessed HT-arrays.
Upper bound of the Subsequence matches of
ltC1,C2,C3gt
3 3 2 2
3 3 2 2
3 2 2
The upper bound of subsequence matches of
ltC1,C2,C3gt from row 1 is 27.
27
51HT bounds
Head-Tail Tree
Upper bound 27
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Lower bound
Push right
1 2 3 4
T-array
ltC1,C2gt
3
3
3
2
1 2 3 4
H-array
4
4
2
0
ltC2,C3gt
Push left
52HT bounds
Head-Tail Tree
Upper bound 27
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Lower bound
Push right
1 2 3 4
T-array
ltC1,C2gt
3
3
3
2
Lower bound of the Subsequence matches of
ltC1,C2,C3gt
1 2 3 4
6
H-array
4
4
2
0
ltC2,C3gt
Push left
53HT bounds
Head-Tail Tree
Upper bound 27
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Lower bound
Push right
1 2 3 4
T-array
ltC1,C2gt
3
3
3
2
Lower bound of the Subsequence matches of
ltC1,C2,C3gt
1 2 3 4
6 6
H-array
4
4
2
0
ltC2,C3gt
Push left
54HT bounds
Head-Tail Tree
Upper bound 27
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Lower bound
Push right
1 2 3 4
T-array
ltC1,C2gt
3
3
3
2
Lower bound of the Subsequence matches of
ltC1,C2,C3gt
1 2 3 4
6 6 6
H-array
4
4
2
0
ltC2,C3gt
Push left
55HT bounds
Head-Tail Tree
Upper bound 27
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
Lower bound
Push right
1 2 3 4
T-array
ltC1,C2gt
3
3
3
2
Lower bound of the Subsequence matches of
ltC1,C2,C3gt
1 2 3 4
6 6 6 2
H-array
4
4
2
0
ltC2,C3gt
Push left
56HT bounds
Head-Tail Tree
Upper bound 27
root
Push left
Size-2 frequent OPSMs table
2
1
3
1 2 3 4
T-array
ltC1,C2gt
4
4
3
0
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
The lower bound of the subsequence matches of
ltC1,C2,C3gt is 20.
Lower bound20
Push right
1 2 3 4
T-array
ltC1,C2gt
3
3
3
2
Lower bound of the Subsequence matches of
ltC1,C2,C3gt
1 2 3 4
6 6 6 2
H-array
4
4
2
0
ltC2,C3gt
20
Push left
57Comparisons
HT array 24
Upper bound 27
1 2 3 4
Push left
T-array
ltC1,C2gt
4
4
2
1
1 2 3 4
1 2 3 4
T-array
ltC1,C2gt
ltC2,C3gt
H-array
4
4
3
0
4
3
2
1
1 2 3 4
Min upper bound 40
H-array
3
3
2
2
ltC2,C3gt
Push right
Recall that the HT-array method can return the
exact subsequence matches of ltC1,C2,C3gt, which
is 24. However, it is not feasible to keep the
HT-arrays for each candidate.
The Min upper bound approach returns 40 as the
upper bound of the subsequence matches of
ltC1,C2,C3gt. Compare with the HT-bound technique,
the HT-bound is much more tighter.
Lower bound20
Push right
1 2 3 4
T-array
ltC1,C2gt
3
3
3
2
1 2 3 4
H-array
4
4
2
0
ltC2,C3gt
Push left
58Generalized HT upper bound
Tail ltC1,C2, ,Cx-1gt
Tail sequence
Upper bound 27
Head ltC2, ,Cx-1, Cxgt
Head sequence
Push left
Middle ltC2, ,Cx-1gt
Middle sequence
New ltC1,C2, ,Cx-1, Cxgt
1 2 3 4
Generated sequence
T-array
ltC1,C2gt
4
4
3
0
Assume the number of replicate for column Cy is
r(Cy) .
1 2 3 4
H-array
3
3
2
2
ltC2,C3gt
Push right
59Generalized HT upper bound
Tail ltC1,C2, ,Cx-1gt
Tail sequence
Upper bound 27
Head ltC2, ,Cx-1, Cxgt
Head sequence
Push left
Middle ltC2, ,Cx-1gt
Middle sequence
New ltC1,C2, ,Cx-1, Cxgt
1 2 3 4
Generated sequence
T-array
ltC1,C2gt
4
4
3
0
Assume the number of replicate for column Cy is
r(Cy) .
1 2 3 4
The maximum possible value for each slot is equal
to the maximum possible subsequence matches for
the middle sequence.
Therefore, the slots for T-array is equal to the
replicates of C1.
H-array
3
3
2
2
ltC2,C3gt
Push right
The qth -slot of the T-array represents the
number of Middle sequences after the qth C1.
1 2 3
Maximum possible value
T-array
ltC1,C2, ,Cx-1gt
60Generalized HT upper bound
Tail ltC1,C2, ,Cx-1gt
Tail sequence
Upper bound 27
Head ltC2, ,Cx-1, Cxgt
Head sequence
Push left
Middle ltC2, ,Cx-1gt
Middle sequence
New ltC1,C2, ,Cx-1, Cxgt
1 2 3 4
Generated sequence
T-array
ltC1,C2gt
4
4
3
0
Assume the number of replicate for column Cy is
r(Cy) .
1 2 3 4
Therefore the slots for H-array is equal to the
maximum possible subsequence matches of the
middle sequence.
H-array
3
3
2
2
ltC2,C3gt
Push right
1 2 3
The qth -slot of the H-array represents the
number of Cxs after the qth Middle sequence.
Maximum possible value
T-array
ltC1,C2, ,Cx-1gt
1 2 3
H-array
ltC2, ,Cx-1, Cxgt
The maximum possible value for each slot in
H-array is equal to the replicates of Cx.
Maximum possible value
61Generalized HT upper bound
Tail ltC1,C2, ,Cx-1gt
Tail sequence
Upper bound 27
Head ltC2, ,Cx-1, Cxgt
Head sequence
Push left
Middle ltC2, ,Cx-1gt
Middle sequence
New ltC1,C2, ,Cx-1, Cxgt
1 2 3 4
Generated sequence
T-array
ltC1,C2gt
4
4
3
0
Assume the number of replicate for column Cy is
r(Cy) .
1 2 3 4
We notice that the push left assignment always
yields a T-array in which the first k slots are
fully filled, and all the slots after the k1
slot are zeros.
H-array
3
3
2
2
ltC2,C3gt
Push right
1 2 3
Maximum possible value
T-array
ltC1,C2, ,Cx-1gt
Let T be the subsequence matches for the Tail
sequence (i.e. 11 in the example)
62Generalized HT upper bound
Tail ltC1,C2, ,Cx-1gt
Tail sequence
Upper bound 27
Head ltC2, ,Cx-1, Cxgt
Head sequence
Push left
Middle ltC2, ,Cx-1gt
Middle sequence
New ltC1,C2, ,Cx-1, Cxgt
1 2 3 4
Generated sequence
T-array
ltC1,C2gt
4
4
3
0
Assume the number of replicate for column Cy is
r(Cy) .
1 2 3 4
We notice that the push left assignment always
yields a T-array in which the first k slots are
fully filled, and all the slots after the k1
slot are zeros.
H-array
3
3
2
2
ltC2,C3gt
Push right
1 2 3
Maximum possible value
T-array
ltC1,C2, ,Cx-1gt
Let T be the subsequence matches for the Tail
sequence (i.e. 11 in the example)
Rule 1 The first slots
with value .
e.g. The first 2 slots with value 4.
63Generalized HT upper bound
Tail ltC1,C2, ,Cx-1gt
Tail sequence
Upper bound 27
Head ltC2, ,Cx-1, Cxgt
Head sequence
Push left
Middle ltC2, ,Cx-1gt
Middle sequence
New ltC1,C2, ,Cx-1, Cxgt
1 2 3 4
Generated sequence
T-array
ltC1,C2gt
4
4
3
0
Assume the number of replicate for column Cy is
r(Cy) .
1 2 3 4
We notice that the push left assignment always
yields a T-array in which the first k slots are
fully filled, and all the slots after the k1
slot are zeros.
H-array
3
3
2
2
ltC2,C3gt
Push right
1 2 3
Maximum possible value
T-array
ltC1,C2, ,Cx-1gt
Let T be the subsequence matches for the Tail
sequence (i.e. 11 in the example)
Rule 1 The first slots
with value .
Rule 2 The
slot with value .
e.g. The 3rd slot with value 11 mod 4 3.
64Generalized HT upper bound
Tail ltC1,C2, ,Cx-1gt
Tail sequence
Upper bound 27
Head ltC2, ,Cx-1, Cxgt
Head sequence
Push left
Middle ltC2, ,Cx-1gt
Middle sequence
New ltC1,C2, ,Cx-1, Cxgt
1 2 3 4
Generated sequence
T-array
ltC1,C2gt
4
4
3
0
Assume the number of replicate for column Cy is
r(Cy) .
1 2 3 4
We notice that the push left assignment always
yields a T-array in which the first k slots are
fully filled, and all the slots after the k1
slot are zeros.
H-array
3
3
2
2
ltC2,C3gt
Push right
1 2 3
Maximum possible value
T-array
ltC1,C2, ,Cx-1gt
Let T be the subsequence matches for the Tail
sequence (i.e. 11 in the example)
Rule 1 The first slots
with value .
Rule 2 The
slot with value .
Rule 3 The other slots with value zero.
65Generalized HT upper bound
Tail ltC1,C2, ,Cx-1gt
Tail sequence
Upper bound 27
Head ltC2, ,Cx-1, Cxgt
Head sequence
Push left
Middle ltC2, ,Cx-1gt
Middle sequence
New ltC1,C2, ,Cx-1, Cxgt
1 2 3 4
Generated sequence
T-array
ltC1,C2gt
4
4
3
0
Assume the number of replicate for column Cy is
r(Cy) .
1 2 3 4
Similar to the T-array, the H-array can be
divided into two partitions, the values in the
first partition are larger than the values in the
second partition by 1.
H-array
3
3
2
2
ltC2,C3gt
Push right
1 2 3
H-array
ltC2, ,Cx-1, Cxgt
Maximum possible value
Let H be the subsequence matches for the Head
sequence (i.e. 10 in the example)
Rule 1 The first
slots with value
.
Rule 2 The other slots with value
.
66Generalized HT upper bound
1 2 3
Maximum possible value
T-array
ltC1,C2, ,Cx-1gt
Rule 1 The first slots
with value .
Let T be the subsequence matches for the Tail
sequence.
Rule 2 The
slot with value .
Rule 3 The other slots with value zero.
1 2 3
H-array
ltC2, ,Cx-1, Cxgt
Maximum possible value
Let H be the subsequence matches for the Head
sequence.
Rule 1 The first
slots with value
.
Rule 2 The other slots with value
.
With these rules, we can deduce a formula to
calculate the upper bound without constructing
these arrays.
Similar method can be applied for the HT-lower
bound, therefore we do not need to materialize
any of the HT-arrays.
67Compression method
Given a column sequence, we would like to find
the subsequence matches of ltC1,C2gt in the column
sequence of row 1.
Transformed Sequence Dataset
The naive method is to enumerate all the size-2
subsequences and count the occurrence of ltC1,C2gt,
which requires enumerating 16 column orderings.
Compressed Sequence Dataset
subsequence matches of ltC1,C2gt in row 1
31
There are 3 C1s on the left of 1C2, therefore
there are 31 3 ltC1,C2gts.
68Compression method
Given a column sequence, we would like to find
the subsequence matches of ltC1,C2gt in the column
sequence of row 1.
Transformed Sequence Dataset
The naive method is to enumerate all the size-2
subsequences and count the occurrence of ltC1,C2gt,
which requires enumerating 16 column orderings.
Compressed Sequence Dataset
subsequence matches of ltC1,C2gt in row 1
31
33
13
There are 1 C1 on the left of 3C2s, therefore
there are 13 3 ltC1,C2gts.
There are 3 C1s on the left of 3C2s,
therefore there are 33 9 ltC1,C2gts.
There are 3 C1s on the left of 1C2, therefore
there are 31 3 ltC1,C2gts.
69Compression method
Given a column sequence, we would like to find
the subsequence matches of ltC1,C2gt in the column
sequence of row 1.
Transformed Sequence Dataset
The naive method is to enumerate all the size-2
subsequences and count the occurrence of ltC1,C2gt,
which requires enumerating 16 column orderings.
Compressed Sequence Dataset
subsequence matches of ltC1,C2gt in row 1
31
33
13
15
There are 15 ltC1,C2gts in total. This way to
obtain the subsequence matches only requires
enumerating 3 column orderings.
70Experimental Evaluation
71Experimental settings
- C programming language
- Machine
- CPU 2.6 GHz
- Memory 1 Gb
- Fedora
- Dataset
- Real dataset Yeast galactose dataset
- Subset of 205 genes (rows) yeast galactose data
- 20 experimental conditions (columns)
- 4 biological replicates per condition
- Publicly available http//expression.microslu.wa
shington.edu/expression/kayee/medvedovic2003/medve
dovic_bioinf2003.html - Synthetic dataset
- Replicate simulation - Generate normal
distributions according to means and variances of
the replicates in the real dataset, and randomly
generate a new replicate value according to the
distribution. - Column simulation Generate a new column by
randomly select an experimental condition in the
real dataset and perturb the mean and variance. - Row simulation Generate normal distributions
according to means and variances of the
replicates in the real dataset, and generate a
new row according to the distributions.
72Execution time per iteration
For the HT-bounds, we use the HT upper bound to
identify infrequent candidates which can be
pruned, and we use the HT lower bound to identify
large OPSMs. We do not verify the subsequence
matches for those large OPSMs.
The number of candidates generated in each
iteration using different bounding techniques
The HT upper bound technique can reduce the
candidates by more than a half in all iterations.
The Brute-force approach is to mine the OPSMs
without using any bounding techniques. All the
algorithms start from mining size-2 OPSMs.
73Execution time per iteration
For the HT-bounds, we use the HT upper bound to
identify infrequent candidates which can be
pruned, and we use the HT lower bound to identify
large OPSMs. We do not verify the subsequence
matches for those large OPSMs.
The number of candidates generated in each
iteration using different bounding techniques
Execution time in each iteration using different
bounding techniques
The HT bounds compression approach uses the HT
upper and lower bounds to reduce candidate set,
and uses the compression method to reduce the
cost of obtaining the subsequence matches of the
candidates.
74Vary the support threshold
The saving from the lower bound increases as the
support threshold decreases. The reason is that
as support requirement decreases, the differences
between the supports of large candidates and the
support requirement increase, those large OPSMs
become more obvious and become more easy to
identify.
Scalability test on support threshold
The HT bounds compression method achieves the
best execution time saving.
Execution time saving () compared with the Brute
force approach
The saving from the HT upper bound decreases as
the support threshold decreases. It is because
its harder for an upper bound to be less than
the support requirement (pruning condition) as
the support requirement decreases.
75Vary the columns
Scalability test on columns
The pruning power of the bounding techniques are
quite independent to the number of columns in the
dataset.
Execution time saving () compared with the Brute
force approach
Essentially, increase in columns will increase
the number of candidates generated but NOT the
cost of obtaining the subseqeunce matches for
the candidates.
76Vary the replicates
Scalability test on replicates
Execution time saving () compared with the Brute
force approach
The saving from both Min upper bound and HT upper
bound decreases as replicates increases. Why?
77Vary the replicates
The number of slots of the T and H arrays are
determined by the replicates, essentially, the
larger the arrays, the looser the bounds.
HT Upper bound
1 2 3
Maximum possible value
T-array
ltC1,C2, ,Cx-1gt
1 2 3
Maximum possible value
H-array
ltC2, ,Cx-1, Cxgt
Execution time saving () compared with the Brute
force approach
The saving from both Min upper bound and HT upper
bound decreases as replicates increases. Why?
78Vary the replicates
The number of slots of the T and H arrays are
determined by the replicates, essentially, the
larger the arrays, the looser the bounds.
HT Upper bound
1 2 3
Maximum possible value
T-array
ltC1,C2, ,Cx-1gt
1 2 3
Maximum possible value
H-array
ltC2, ,Cx-1, Cxgt
Execution time saving () compared with the Brute
force approach
In Min upper bound, we multiply the replicate of
C3 with subsequences of ltC1,C2gt. The tightness
of the Min bound is also determined by the
replicates.
The saving from both Min upper bound and HT upper
bound decreases as replicates increases. Why?
79Vary the replicates
The saving from HT bounds compression method
increases as increases. This is mainly due to
the saving from compressing the sequence s.t. the
enumerated sequences is reduced.
Scalability test on replicates
Execution time saving () compared with the Brute
force approach
The saving from both Min upper bound and HT upper
bound decreases as replicates increases. Why?
80Vary the rows
Scalability test on rows
81Conclusion
- Single microarray output is subject to
substantial variability, replication is the
common practice to address this issue. - We have proposed a scoring model to mine the
Order Preserving Submatrixes from gene expression
dataset with repeated measurements. - Mining OPSMs under the scoring model requires
heavy computational cost (obtaining subsequence
matches) - An HT Bounding technique and compression method
is proposed to efficiently mine the OPSMs. - Experimental results show that the HT bounding
technique compression method achieves the best
CPU cost saving.
82Things not covered in this talk
- Biological evaluation of cluster quality
oPOSSIUM, Gene Ontology, ARI - Efficient method of the subsequence function.
- Prefix tree to organize the candidates, verify
the supports through a single dataset scan. - Compression on the sequence dataset, reduce the
prefix tree traversal. - Bounding techniques
- Application in other areas Collaborative
Filtering - Visualization of OPSMs
83End