Title: Context-Specific Bayesian Clustering for Gene Expression Data
1Context-Specific Bayesian Clustering for Gene
Expression Data
- Yoseph Barash Nir Friedman
- School of Computer Science Engineering
- Hebrew University
2Introduction
- New experimental methods ? abundance of data
- Gene Expression
- Genomic sequences
- Protein levels
-
- Data analysis methods are crucial for
understanding such data - Clustering serves as tool for organizing the data
and finding patterns in it
3This Talk
- New method for clustering
- Combines different types data
- Emphasis on learning context-specific description
of the clusters - Application to gene expression data
- Combine expression data with genomic information
4The Data
Experiments
Binding Sites
Genes
i
- Goal
- Understand interactions between TF and expression
levels
5Simple Clustering Model
- attributes are independent given the cluster
- Simple model ? computationally cheap
- Genes are clustered according to both expression
levels and binding sites
6Local Probability Models
TF1
TF2
Multinomial
Gaussian
7Structure in Local Probability Models
TF1
TF2
8Context Specific Independence
- Benefits
- Identifies what features characterize each
cluster - Reduces bias during learning
- A compact and efficient representation
9Scoring CSI Cluster Models
- Represent conditional probabilities with
different parametric families - Gaussian,
- Multinomial,
- Poisson
- Choose parameters priors from appropriate
conjugate prior families - Score
- where
MarginalLikelihood
Prior
10Learning Structure Naive Approach
Basic problem efficiency
11Learning Structure Structural EM
We can evaluate each edges parameters separately
given complete data for MAP we compute EM only
once for each iteration Guaranteed to converge to
a local optimum
12Results on Synthetic Data
- Basic approach
- Generate data from a known structure
- Evaluate learned structures for different sample
numbers (200 800). - Add noise of unrelated samples to the training
set to simulate genes that do not fall into
nice functional categories (10-30). - Test learned model for structure as well as for
correlation between its tagging and the one
given by the original model. - Main results
-
Cluster number models with fewer clusters were
sharply penalized. Often models with 1-2
additional clusters got similar score , with
degenerate clusters none of the real samples
where classified to.
Structure accuracy very few false negative edges
, 10-20 false positive edges (score dependent)
Mutual information Ratio max for 800 samples ,
100-95 for 500 and 90 for 200 samples.
13Yeast Stress Data (Gasch et al 2001)
- Examines response of yeast to stress situations
- Total 93 arrays
- We selected 900 genes that changed in a
selective manner - Treatment steps
- Initial clustering
- Found putative binding sites based on clusters
- Re-clustered with these sites
14Stress Data -- CSI Clusters
15CSI Clusters
16Promoters Analysis
- Cluster 3
- MIG1 CCCCGC, CGGACC, ACCCCG
- GAL4 CGGGCC
- Others CCAATCA
17Promoters Analysis
- Cluster 7
- GCN4 TGACTCA
- Others CGGAAAA, ACTGTGG
18Discussion
- Goals
- Identify binding sites/transcription factors
- Understand interactions among transcription
factors - Combinatorial effects on expression
- Predict role/function of the genes
- Methods
- Integration of model of statistical patterns of
binding sites (see Holmes Bruno, ISMB00) - Additional dependencies among attributes
- Tree augmented Naive Bayes
- Probabilistic Relational Models (see poster)