Title: Biclustering in Gene Expression Datasets
1Biclustering in Gene Expression Datasets
Data mining for related genes in microarray
datasets
Kenneth Bryan Machine Learning Group Supervisor
Pádraig Cunningham
2- Background Brief Why study gene expression?
- The goal of biology is fully understand how
living things function. The behavior and
structure of an organism is governed principally
by its genes. - If we understand the function of each gene this
will lead to a fuller understanding of the whole
organism (reductionism). - The classic way to discover the function of genes
was to delete a particular gene and observe the
development of the organism.
Observe organisms development
Note Simpler organisms such as bacteria
(Escherichia coli) and yeast (Saccharomyces
cerevisiae) are more often used in these
experiments.
Delete Gene X creating mutant.
Gene X involved in limb development.
- However this one gene, one experiment approach
is too slow to analyze the vast amount of genomic
information we have today (HGSP 30,000 genes). - Microarray Gene Expression Analysis is a recently
developed experiment that enables the study of
the activity of many genes simultaneously.
3Microarray Gene Expression Analyses
In Microarray Gene Expression Analysis the
expression (activity) of genes are monitored
over various experimental conditions. Many genes
have unknown function. Through analysis of the
expression of genes and the nature of the
conditions under which they are expressed allows
the identification of groups of related genes and
their function.
Glucose Breakdown
Amino Acid Metabolism
Cell Growth
Osmoregulation
Functions can then be assigned to these groups by
examining the conditions involved (Temperature,
Starvation, High Salt, Disease etc.)
4Data mining in Microarrays
However microarrays can contain 1000s of genes
and 100s of conditions. Data-mining Techniques
such as Cluster Analysis are typically used to
analyze these large gene expression
datasets. Cluster Analysis is an unsupervised
grouping technique used to group similar objects
(in this case genes) into disjoint (unique ) sets
based on their attribute (conditions) similarity.
Cluster A
Cluster B
Clustering
Graph of Gene Expression Vs Conditions
Cluster C
Similar rows are grouped together into unique
clusters. Each Cluster may represent a group of
functionally related genes (Biological Network).
5Example of a Biological Network Thiamine
(Vitamin B ) Biosynthesis
6Some Difficulties with Clustering in High
Dimensional Microarray Datasets
1. As the number of attributes (conditions)
increases it becomes increasingly unlikely that
objects (genes) will retain similarity over all
attributes and clustering may become difficult.
Difficult to perform standard clustering over all
attributes
Expression Level
Conditions
2. Also in the gene expression context it is
common for related genes to be correlated under
some conditions and act independently in others.
3. Clustering over all conditions may miss
relationships present only over a locality of
attributes these may be more significant then
the cluster involving all attributes.
To help tackle these drawbacks Cheng and church
introduced Biclustering into the field of gene
expression.
7Biclustering Vs Clustering
Clustering
Bicluster 1,2,3,5,7,10 A,B,C,D,E,F
Similarity does not exist over all
attributes Solution Cluster both Row and
Columns Simultaneously - Biclustering
The problem of searching for biclusters in a
deterministic way however is NP-Hard with the
search space increasing exponentially with the
object/attribute number.
Cheng and Church introduced a Greedy Node
Deletion Algorithm based around a bicluster
scoring function called the Mean Squared Residue
Score.
8Mean Square Residue Score
The Mean Square Residue Score is based on the
idea of a Residue score.
J
j
I
a
i
The Mean Squared Residue is then calculated for
the entire matrix. This gives a measure of how
well the rows/columns fit together in the
matrix. The lower the score the more correlated
the matrix with a perfectly correlating matrix
having a score of 0.
9Cheng and Churchs Greedy Node Deletion Algorithm
Node Deletion involves deleting Ill-fitting Row
and Columns from the data matrix until a
sub-matrix of the chosen score (d) is reached.
Input Data matrix, d Threshold 300
d-bicluster
Score 1,052 Score 543
Score423 Score 300
Although as with other Greedy methods there is a
good chance that this search will become stuck on
a good local solution and fail to find the best
(largest) d-bicluster.
10Our Approach Using Simulated Annealing to Search
for Maximal Biclusters.
Greedy Search
Simulated Annealing
Solution X
Perturb Solution
Return to Solution X and re-iterate
Solution Y
Score Solution Y
Better
Worse
Probabilistic Acceptance
Solution Y Accepted (reiterate with Solution Y)
11Simulated Annealing Search
Large Reversals Early on
Greedy Search
Fitness Score
Convergence on Solution Close to Global Optima at
low temperature
Simulated Annealing Search
Temperature
t0
A Simulated Annealing search has the ability to
by-pass locally good solutions that are
encountered early on and search other areas in
the search space. The annealing schedule (how
fast the temperature decreases) dictates when the
system will converge.
We applied Simulated Annealing to the d-bicluster
search within gene expression data
12Biclustering Using Simulated Annealing Yeast
Data Matrix
Adjusted CC Algorithm Produces Bicluster of
Same Column Size to Simulated Annealing
Original Cheng and Church Node Deletion Algorithm
Simulated Annealing Algorithm
Sizes of Three Different Delta- Biclusters
Compared
Simulated Annealing produces Larger
delta-Biclusters than Node Deletion in the Yeast
Data Set
Similar results are achieved for the second
bicluster search
13Biclustering Using Simulated Annealing
First Bicluster
Second Bicluster
Yeast Data
Human Data
Simulated Annealing achieves better results
(discovers larger delta-biclusters ) than the
original Church and Cheng Algorithm over all data
sets.
Simulated Annealing beats the adjusted Church
and Cheng Algorithm in 4/9 cases and draws in a
further 3/9.
14Applying SA to Annotated Data Set
Of the 2884 genes in the yeast gene expression
matrix 550 can be annotated form the online
database Kyoto Encyclopedia of Genes and Genomes
(KEGG).
These genes were biclustered using Simulated
Annealing to determine if functionally related
groups could be found.
Results
Bicluster 1
The first bicluster contains a high proportion of
Ribosomal Genes.
The second bicluster contains a significant
number of genes involved in regulation of gene
expression.
However no dominating groups exist in
subsequently discovered biclusters. This may be
as a result of the top- down search method which
finds artificially large biclusters which are not
reflected in biology.
15Conclusions and Future work
Biclustering Using Simulated Annealing (SA)
search technique shows improvements on the Node
Deletion Technique proposed by Church and Cheng.
SA finds larger d-biclusters and biologically
verifiable biclusters (known related genes).
However no dominating groups exist in
subsequently discovered biclusters. Part of the
reason for this may be as a result of the
top-down search method which finds artificially
large biclusters which are not reflected in
biology.
A future approach could involve a bottom-up type
search which would find more natural clustering
of genes. Such an approach could be taken using
SA or other bottom- up type searches such as Beam
Search.
The work discussed here is due to be published in
the Proceeding of 18th IEEE Symposium on Computer
Based Medical Systems (CBMS).