Title: SVM
1Statistical Classification for Gene Analysis
based on Micro-array Data
- Fan Li Yiming Yang
- hustlf_at_cs.cmu.edu
- In collaboration with Judith Klein-Seetharaman
2Principles of cDNA microarray
DNA clones
Laser 2
Treated sample
Laser 1
Reference
Excitation
Reverse transcription
PCR purification
Emission
Label with Fluorescent dyes
Robot printing
Hybridize target to microarray
Computer analysis
G. Gibson et al.
3Microarray data how it looks like ?
Expression level of a gene across treatments
Expression matrix
Expression profiles of genes in a certain
condition
Typical examples Heat shock, G phase in cell
cycle, etc conditions Liver cancer patient,
normal person, etc samples
4AML/ALL micro-array dataset
- This dataset can be downloaded from
http//genome-www.standford.edu/clustering - Maxtrix
- Each Row a gene
- Each column a patient (a sample)
- Each patient belong to one of two diseases
types AML(acute myeloid leukemia) or ALL (acute
lymph oblastic leukemia) disease - The 72 patient samples are further divided into a
training set(including 27 ALLs and 11 AMLs) and a
test set(including 20 ALLs and 14 AMLs). The
whole dataset is over 7129 probes from 6817 human
genes.
5Published work on AML/ALL
- Classification task gene expression -gt AML,
ALL - Techniques Support Vector Machings (SVM),
Rocchio-style and logistic regression classifiers - Main findings classifiers can get a better
performance when using a small subset (8) of
genes, instead of thousands - Implication Many genes are irrelevant or
redundant?
6Possible Relationship (Hypothesis)
7How can find such a structure?
- Find the most informative genes (primary ones)
- Statistical feature selection (brief)
- Find the genes related (or similar) to the
primary ones - Unsupervised clustering (detailed)
- based on statistical patterns of gene distributed
over microarrays - Bayes network for causal reasoning(future
direction)
8Possible Relationship (Hypothesis)
disease
9Feature selection
- Feature selection
- Choose a small subset of input variable (a few
instead of 7000 genes, for example) - In text categorization
- Features words in documents
- Output variables subject categories of a
document - In protein classification
- Features amino acid motifs
- Output variables protein categories
- In genome micro-array data
- Features useful genes
- Output variables diseased or not of a patient
10Feature selection on micro-array (ALM vs ALL)
- Golub-Slonim GS-ranking (filtering method)
- Ben-Dor TNoM-ranking (filtering method)
- Isabelle-Guyon Recursive SVM(Wrapper method)
- Selected 8 genes (out of 1000 in that dataset)
- Accuracy 100
- Our work (Fan Yiming) (best)
- Selected 3 genes (using Ridge regression)
- Accuracy 100
11Feature selection experiments already done in
this micro-array data
- The 3 genes we found
- Id1882 CST3 Cystatin C(amyloid angiopathy and
cerebral hemorrhage) M27891_at - Id6201 INTERLEUKIN-8PRECURSOR Y00787_at
- Id4211 VIL2 Villin 2(ezrin) X51521_at
12Some analysis on the result we get
- The first two genes are strongly correlated with
each other. - The third gene is very different from the first
two genes. - 1st gene 2nd gene is bad (10/34 errors)
- 1st gene 3rd gene is good (1/34 error)
13QuestionAs the next step, Can we find more
gene-gene relationship?
- Several techniques available
- Clustering
- Bayesian network learning
- Independent component analysis
-
14Clustering Analysis in micro-array data
- Clustering methods have already been widely used
to find similar genes or common binding sites
from micro-array data. - A lot of different clustering algorithms
- Hierarchical clustering
- K-means
- SOM
- CAST
15A example of hierarchical clustering
analysis(from Spellman et al.)
16Our clustering experiment on AML/ALL dataset
- Our clustering result is over the top 1000 genes
most relevant to the disease.
17The feature-selection curve
18Our clustering result in the top 1000 genes
19Some analysis to the clustering result
- The first two genes are always clustered in the
same cluster(in hierarchical clustering, they are
in cluster 1. In k-means clustering, they are in
cluster 2) - The third gene is always not clustered in the
same group with the first two genes(in
hierarchical clustering, it is in cluster 23. In
k-means clustering, it is in cluster 1) - This validates our previous analysis.
20Disadvantage of Clustering
- However
- It can not find out the internal relationship
inside one cluster - It can not find the relationship between
clusters - genes connected to each other may not be in the
same cluster. - Clustering vs Bayesian network learning(copied
from David K,Gifford, Science, VOL293, Sept,2001)
21A counter example of clustering analysis
22Bayesian network learning
- Thus Bayesian network seems a much better
technique if we want to model the relationship
among genes. - Researcher have done experiments and constructed
bayesian networks from micro-array data. - They found there are a few genes which have a lot
of connections with other genes. - They use prior biology knowledge to validate
their learned edges(interactions between genes
and found they are reasonable)
23A example of the bayesian network
- Part of the bayesian network Nir Friedman
constructed. There are total 800 genes(nodes) in
the graph. These 800 genes are all cell-cycle
regulated genes.
24(No Transcript)
25Our plan in genetic regulatory network
construction
- There are several possible ways
- Using feature selection technique to make the
network learning task more robust and with less
computational cost. - Learning gene regulatory networks on microarray
dataset with disease labels(thus we may find
pathways relevant to specific disease). - Using ICA to finding hidden variables(hidden
layers) and check its consistency with bayes
network learning result.
26Our plan in genetic regulatory network
construction
- Use prior prior biology knowledge in gene network
,like the network motifs. The following example
is copied from Shai S.Shen-Orr, Naturtics
,genetics, 2002. Previous network learning
algorithm have not considered those characters.
27(No Transcript)
28Reference
- Using Bayesnetwork to analyze Expression Data ,
Nir Friedman, M.Linial, I.Nachman, Journal of
Computational Biology , 7601-620, 2000. - Gene selection for cancer classification using
support vector machines. Guyon,I.et al. Machine
Learning,46,389-422. - Clustering analysis and display of genome-wide
expression patterns, Eisen,M.B. et al. PNAs,
9514863-14868, 1998 - Clustering gene expression patterns . Ben-Dor,
A.,Shamir,R., and Yakini,Z., Computational
Biology, 6(3/4)281-297, 1999.