ContextSpecific Bayesian Clustering for Gene Expression Data - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

ContextSpecific Bayesian Clustering for Gene Expression Data

Description:

Structural EM algorithm with specified size of clusters ... Restart the structural EM algorithm and find the local maxma again. If the score of this model is ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 23
Provided by: Kyubae8
Category:

less

Transcript and Presenter's Notes

Title: ContextSpecific Bayesian Clustering for Gene Expression Data


1
Context-Specific Bayesian Clustering for Gene
Expression Data
  • Yoseph Barash and Nir Friedman
  • RECOMB01
  • Talk by Kyu-Baek Hwang

2
Abstract
  • Clustering of genes by their expression levels
    and their putative TF binding sites
  • Extended naïve Bayes classifier
  • Context-specific independencies
  • Structural EM algorithm
  • Experiments on the yeast dataset

3
Introduction
  • A central goal of molecular biology
  • To understand the regulation of protein synthesis
  • New technologies
  • DNA sequencing (promoter regions that contain the
    binding sites of transcription factors)
  • DNA microarrays
  • The hypothesis
  • Genes with a common functional role have similar
    expression patterns across different experiments
    and similar binding sites
  • Clustering of genes

4
Naïve Bayesian Clustering
  • Random variables X1, , XN
  • The expression levels of a particular gene
    through all experiments
  • The number of occurrences of each binding sites
    in the promoter region
  • A joint probability distribution
  • P(X1, , XN)
  • A dataset D sampled from P(X1, , XN)
  • A probabilistic model that represents P(X1, ,
    XN) efficiently

5
Naïve Bayesian Clustering (Contd)
  • Naïve Bayes assumption

6
Naïve Bayesian Clustering (Contd)
  • For conditionals
  • The probability of an example belonging to a
    cluster

7
Selective Naïve Bayesian Models
  • All the variables X1, , XN are not dependent on
    the cluster.
  • The simple naïve model is vulnerable to noisy
    samples.

8
Context-Specific Independence
  • If a certain binding site Xi is regulating genes
    in only two clusters,
  • P(Xi C) is represented by

9
Learning of Bayesian Clustering Models
  • The Bayesian clustering model in this paper is a
    subclass of Bayesian networks.
  • The cluster label is missing.
  • Structural EM algorithm with specified size of
    clusters

10
The Bayesian Score
  • A model M ltK, Ligt
  • Model parameters
  • The Bayesian approach
  • Model prior

11
The Bayesian Score (Contd)
  • The likelihood
  • Decomposable conjugate priors

12
For Complete Data
  • If we know the number of clusters and to which
    cluster an example belongs,
  • The marginal likelihood

13
For Incomplete Data
  • Approximation to the marginal likelihood
  • The Bayesian information criterion (BIC)
  • The Cheeseman-Stutz (CS) score

14
Learning CSI Clustering
  • The search space is huge.
  • Bayesian structural EM algorithm

15
Structural EM Procedure
  • Initialization
  • The full model
  • The random model
  • To escape local maxima
  • A random first-ascent hill-climb search
  • Restart the structural EM algorithm and find the
    local maxma again
  • If the score of this model is better than the
    former, restart the structural EM algorithm

16
Simulation Studies
  • The synthetic dataset
  • Five clusters
  • 50 continuous variables and 100 discrete
    variables
  • Training sets of size 200, 500, and 800
  • A test set of size 2000
  • Additional noises (10 and 30 of genes)
  • The number of clusters ranging from 3 to 7 (K
    values)

17
Result on the Simulation Studies
  • Cluster number
  • Models with fewer clusters were sharply
    penalized.
  • Structure accuracy
  • Small false negative
  • Mutual information ratio I(Ct, Ce)
  • With the more training data, there was the more
    information gain.

18
Biological Data
  • Budding yeast gene expression data
  • Spellmans cell-cycle data (800 genes and 77
    experiments)
  • Garshs stress data (1271 genes and 92
    experiments)
  • The number of putative binding sites in the
    1000bp upstream of the ORF (from SCPD and
    TRANSFAC) for each gene
  • With MatInspector program
  • The whole upstream region and four sub-regions
  • Initialization point
  • k-means clustering with only gene expression data

19
Clusterings for Two Datasets
20
Clusters for the Stress Data
21
Clusters for the Cell-Cycle Data
22
Discussion
  • Bayesian clustering
  • Two heterogeneous data
  • Bayesian networks
Write a Comment
User Comments (0)
About PowerShow.com