SVM - PowerPoint PPT Presentation

About This Presentation
Title:

SVM

Description:

Statistical Classification for Gene Analysis based on Micro-array ... Researcher have done experiments and constructed bayesian networks from micro-array data. ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 29
Provided by: jia130
Learn more at: http://www.cs.cmu.edu
Category:
Tags: svm | researcher

less

Transcript and Presenter's Notes

Title: SVM


1
Statistical Classification for Gene Analysis
based on Micro-array Data
  • Fan Li Yiming Yang
  • hustlf_at_cs.cmu.edu
  • In collaboration with Judith Klein-Seetharaman

2
Principles of cDNA microarray
DNA clones
Laser 2
Treated sample
Laser 1
Reference
Excitation
Reverse transcription
PCR purification
Emission
Label with Fluorescent dyes
Robot printing
Hybridize target to microarray
Computer analysis
G. Gibson et al.
3
Microarray data how it looks like ?
Expression level of a gene across treatments
Expression matrix
Expression profiles of genes in a certain
condition
Typical examples Heat shock, G phase in cell
cycle, etc conditions Liver cancer patient,
normal person, etc samples
4
AML/ALL micro-array dataset
  • This dataset can be downloaded from
    http//genome-www.standford.edu/clustering
  • Maxtrix
  • Each Row a gene
  • Each column a patient (a sample)
  • Each patient belong to one of two diseases
    types AML(acute myeloid leukemia) or ALL (acute
    lymph oblastic leukemia) disease
  • The 72 patient samples are further divided into a
    training set(including 27 ALLs and 11 AMLs) and a
    test set(including 20 ALLs and 14 AMLs). The
    whole dataset is over 7129 probes from 6817 human
    genes.

5
Published work on AML/ALL
  • Classification task gene expression -gt AML,
    ALL
  • Techniques Support Vector Machings (SVM),
    Rocchio-style and logistic regression classifiers
  • Main findings classifiers can get a better
    performance when using a small subset (8) of
    genes, instead of thousands
  • Implication Many genes are irrelevant or
    redundant?

6
Possible Relationship (Hypothesis)
7
How can find such a structure?
  • Find the most informative genes (primary ones)
  • Statistical feature selection (brief)
  • Find the genes related (or similar) to the
    primary ones
  • Unsupervised clustering (detailed)
  • based on statistical patterns of gene distributed
    over microarrays
  • Bayes network for causal reasoning(future
    direction)

8
Possible Relationship (Hypothesis)
disease
9
Feature selection
  • Feature selection
  • Choose a small subset of input variable (a few
    instead of 7000 genes, for example)
  • In text categorization
  • Features words in documents
  • Output variables subject categories of a
    document
  • In protein classification
  • Features amino acid motifs
  • Output variables protein categories
  • In genome micro-array data
  • Features useful genes
  • Output variables diseased or not of a patient

10
Feature selection on micro-array (ALM vs ALL)
  • Golub-Slonim GS-ranking (filtering method)
  • Ben-Dor TNoM-ranking (filtering method)
  • Isabelle-Guyon Recursive SVM(Wrapper method)
  • Selected 8 genes (out of 1000 in that dataset)
  • Accuracy 100
  • Our work (Fan Yiming) (best)
  • Selected 3 genes (using Ridge regression)
  • Accuracy 100

11
Feature selection experiments already done in
this micro-array data
  • The 3 genes we found
  • Id1882 CST3 Cystatin C(amyloid angiopathy and
    cerebral hemorrhage) M27891_at
  • Id6201 INTERLEUKIN-8PRECURSOR Y00787_at
  • Id4211 VIL2 Villin 2(ezrin) X51521_at

12
Some analysis on the result we get
  • The first two genes are strongly correlated with
    each other.
  • The third gene is very different from the first
    two genes.
  • 1st gene 2nd gene is bad (10/34 errors)
  • 1st gene 3rd gene is good (1/34 error)

13
QuestionAs the next step, Can we find more
gene-gene relationship?
  • Several techniques available
  • Clustering
  • Bayesian network learning
  • Independent component analysis

14
Clustering Analysis in micro-array data
  • Clustering methods have already been widely used
    to find similar genes or common binding sites
    from micro-array data.
  • A lot of different clustering algorithms
  • Hierarchical clustering
  • K-means
  • SOM
  • CAST

15
A example of hierarchical clustering
analysis(from Spellman et al.)
16
Our clustering experiment on AML/ALL dataset
  • Our clustering result is over the top 1000 genes
    most relevant to the disease.

17
The feature-selection curve
18
Our clustering result in the top 1000 genes
19
Some analysis to the clustering result
  • The first two genes are always clustered in the
    same cluster(in hierarchical clustering, they are
    in cluster 1. In k-means clustering, they are in
    cluster 2)
  • The third gene is always not clustered in the
    same group with the first two genes(in
    hierarchical clustering, it is in cluster 23. In
    k-means clustering, it is in cluster 1)
  • This validates our previous analysis.

20
Disadvantage of Clustering
  • However
  • It can not find out the internal relationship
    inside one cluster
  • It can not find the relationship between
    clusters
  • genes connected to each other may not be in the
    same cluster.
  • Clustering vs Bayesian network learning(copied
    from David K,Gifford, Science, VOL293, Sept,2001)

21
A counter example of clustering analysis
22
Bayesian network learning
  • Thus Bayesian network seems a much better
    technique if we want to model the relationship
    among genes.
  • Researcher have done experiments and constructed
    bayesian networks from micro-array data.
  • They found there are a few genes which have a lot
    of connections with other genes.
  • They use prior biology knowledge to validate
    their learned edges(interactions between genes
    and found they are reasonable)

23
A example of the bayesian network
  • Part of the bayesian network Nir Friedman
    constructed. There are total 800 genes(nodes) in
    the graph. These 800 genes are all cell-cycle
    regulated genes.

24
(No Transcript)
25
Our plan in genetic regulatory network
construction
  • There are several possible ways
  • Using feature selection technique to make the
    network learning task more robust and with less
    computational cost.
  • Learning gene regulatory networks on microarray
    dataset with disease labels(thus we may find
    pathways relevant to specific disease).
  • Using ICA to finding hidden variables(hidden
    layers) and check its consistency with bayes
    network learning result.

26
Our plan in genetic regulatory network
construction
  • Use prior prior biology knowledge in gene network
    ,like the network motifs. The following example
    is copied from Shai S.Shen-Orr, Naturtics
    ,genetics, 2002. Previous network learning
    algorithm have not considered those characters.

27
(No Transcript)
28
Reference
  • Using Bayesnetwork to analyze Expression Data ,
    Nir Friedman, M.Linial, I.Nachman, Journal of
    Computational Biology , 7601-620, 2000.
  • Gene selection for cancer classification using
    support vector machines. Guyon,I.et al. Machine
    Learning,46,389-422.
  • Clustering analysis and display of genome-wide
    expression patterns, Eisen,M.B. et al. PNAs,
    9514863-14868, 1998
  • Clustering gene expression patterns . Ben-Dor,
    A.,Shamir,R., and Yakini,Z., Computational
    Biology, 6(3/4)281-297, 1999.
Write a Comment
User Comments (0)
About PowerShow.com