Title: Biology-Driven Clustering of Microarray Data:
1Biology-Driven Clustering of Microarray Data
- Applications to the NCI60 Data Set
K.R. Coombes, K.A. Baggerly, D.N. Stivers, J.
Wang, D. Gold, H.G. Sung, and S.J. Lee
2Introduction
Methods
- Most analyses of microarray data proceed as
though it were simply a large, unstructured
matrix. Such analyses ignore substantial amounts
of existing biological information. In the study
of cancer, we already know many important genes
through their involvement in specific biological
processes, and we know that reproducible
chromosomal abnormalities play an important role.
We see a need for developing analytic strategies
that exploit this biological information.
- We analyzed the NCI60 data set by first
determining the chromosomal location and
biological function of the genes on the
microarray. We performed separate analyses using
genes on individual chromosomes and genes
involved in different biological processes. The
fundamental advantage of this approach is that it
provides results that are immediately and
directly interpretable without resorting to ex
post facto rationalizations.
3How many genes on the microarray have good
annotations?
- Problem
- I.M.A.G.E. clone IDs and GenBank accession
numbers are archival. - UniGene clusters, gene names, descriptions, etc.,
are changeable. - Solution
- Download the latest version of UniGene (build
137) and LocusLink (July 2001) to update
annotations, using the GenBank accession numbers
describing both 3 and 5 ends of the genes
spotted on the microarrays.
Table 1 There are only 7478 spots (out of
10,000) on the array with valid, matching UniGene
cluster IDs. Genes with unknown or conflicting
annotations were eliminated before performing any
further analysis.
4Where are the genes located?
We compared the number of genes on the microarray
that mapped to each chromosome with the number
known to be on the chromosome, based on current
figures from the NCBI. A chi-squared test was
used to test whether the distribution of genes on
chromosomes was uniform.
Figure 1 Distribution of the genes on the array
by chromosome. Chromosomes 19 and Y are
substantially underrepresented when compared to
the numbers known to LocusLink chromosomes 6 and
13 are overrepresented.
5How do we determine gene functions?
- Using our updated UniGene clusters, we followed
the links from UniGene to LocusLink to
GeneOntology. - GeneOntology is a structured, hierarchical
vocabulary to describe gene functions in three
broad areas - biological process (why)
- molecular function (what)
- cellular component (where)
- The 7478 good spots on the array corresponded to
6614 distinct genes, of which 5074 were known to
LocusLink, and 2989 had at least one annotation
in GeneOntology.
- We focused on the biological process annotations
in the GeneOntology vocabulary, since these had
the most natural interpretation for application
to the study of cancer. We counted the number of
genes having annotations of functions at or below
each level in the hierarchy, and selected a set
of categories that each contained roughly one to
a few hundred genes, with the categories as a
whole accounting for more than 95 of all
annotations (Table 2).
6What functional categories are represented on the
array?
Table 2 The number of annotations (Ann.) into
and the number of spots on the array in various
functional categories chosen from the biological
process annotations from LocusLink into
GeneOntology. Individual spots may have multiple
annotations into the same category individual
genes may be represented by multiple spots.
7How good is a dendrogram?
We introduced a quality grade, based on the
dendrograms, to describe how well each set of
genes used to produce a dendrogram classifies
each kind of cancer
- A there is a cluster containing all and only
one kind of cancer - B all, with one or two extras
- C all except one
- D all except one, with extras
- E all except two
- F all except two, with extras
Grades for the dendrogram of Figure 2 are
displayed in the following table.
Figure 2 Dendrogram using all genes with valid
annotations and with expression levels
above those of the blank spots.
8Heterogeneity of different types of cancer
- Some cancers (colon, leukemia) are fairly
homogeneous and easy to distinguish from others. - Some (breast, lung) are so heterogeneous as to be
nearly impossible to distinguish. - Some chromosomes (1, 2, 6, 7, 9, 12, 17) can
distinguish many types of cancer. - Some (16, 21) can not accurately distinguish any
kind of cancer. The dendrograms using genes from
these chromosomes are equivalent to randomly
scrambling of the cancer cell lines.
Table 3 Grades given to dendrograms that cluster
samples by genes on specific chromosomes. Grades
range from A to F, with blanks indicating no
clustering for that type of sample.
Abbreviations Bbreast, Ccolon, Lleukemia,
Mmelanoma, Nnon small cell lung, Oovarian,
Pprostate, Rrenal, Scentral nervous system.
9Chromosome 2
Figure 3 The genes on chromosome 2 do
an excellent job of distinguishing cancer types.
We can also locate specific clusters of genes on
the chromosome with strong signatures
identifying leukemia, melanoma, and colon cancer.
10Chromosome 16
Figure 4 Genes on chromosome 16 cannot
reliably distinguish any single kind of cancer in
this study. There are, nevertheless, strong gene
signatures driving the clustering, which does not
appear to match anything we know about the
biology of the samples.
11Protein Metabolism
Figure 5 The genes involved in protein
metabolism do an excellent job of distinguishing
cancer types. We can also locate specific
clusters of genes on the chromosome with strong
signatures identifying leukemia, colon cancer,
lung cancer, and central nervous system cancer.
12Apoptosis
Figure 6 The genes involved in apoptosis do a
poor job of distinguishing cancer types. This
suggests that the mechanisms by which cancers
overcome cell death cut across the normal
biological lines drawn by histology.
13Conclusions
- Functional categories that are good at
distinguishing cancers include signal
transduction, cell cycle, cell proliferation, and
protein metabolism. Some differences result from
the histology of the underlying tissue. Others
reflect differences in the way particular kinds
of cancers overcome limits on cell growth. - Categories that are poor at distinguishing
cancers include energy pathways and apoptosis.
The latter observation has potential implications
for cancer therapies designed to trigger
apoptosis, since it suggests that the mechanisms
by which cancer cells avoid cell death are not
linked to the general type of cancer but are
either common across cancers or idiosyncratic.
- Multiple views into the data provide substantial
insight into differences in cancer types and gene
sets. - Cancer types differ greatly in their degree of
heterogeneity, ranging from homogeneous (colon,
leukemia) through moderately heterogeneous
(renal, melanoma) to extremely heterogeneous
(breast and lung). - Homogeneous cancers exhibit strong identifying
signals across most views of the data, regardless
of function or chromosome. - There are large difference in the ability of
genes of different chromosomes to distinguish
cancer types. There are similar differences for
genes involved in different biological processes
(data not shown).