Title: Transcriptional Diagnosis by Bayesian Network
1Transcriptional Diagnosis by Bayesian Network
- Hsun-Hsien Chang and Marco F. Ramoni
Childrens Hospital Informatics
Program Harvard-MIT Division of Health Sciences
and Technology Harvard Medical School March 17,
2009
2Background
- Microarray technology enables profiling
expression of thousands of genes in parallel on a
single chip. - Comparative analysis of gene expression across
tissue states extracts signature genes for
disease diagnosis. - Challenge
- Number of variables (i.e., genes) is much greater
than the number observations (i.e., biological
samples), inducing the problem of overfitting. - Existing methods
- Gene selection compute statistics (eg.,
t-statistics, SNR, PCA) of individual genes and
select high rank genes. - Classification model create a classification
function of selected genes.
3Proposed Approach
- Issues
- Assumption on gene independencies is inadequate.
- Other genes may be collinearly expressed with the
signature. - Selection and classification are two
non-integrated steps. Need a cut-off threshold to
select high rank genes. - Proposed strategies
- Adopt system biology approach to infer the
functional dependence among genes. - Use the dependence network for tissue
discrimination. - Integrate gene selection and classification model
in Bayesian network framework.
4Data Representation by Bayesian Network
- Bayesian networks are directed acyclic graphs
where - Node corresponds to random variables.
- Directed arcs encode conditional probabilities of
the target nodes on the source nodes.
5Gene Selection by Bayes Factor
6Collinearity Elimination via Network Learning
7Sample Classification
- The phenotype variable is independent of the blue
genes, given the green genes. - Technically, the green genes are under the Markov
blanket of the phenotype variable, and they are
the signature genes used for phenotype
determination. - Tissue classification
8Algorithm Summary
Gene Selection by Bayes Factor
Collinearity Elimination
Sample Classification
(sensitivity analysis)
9Discriminate Lung Carcinoma Subtypes
- Adenocarcinoma (AC) and squamous cell carcinoma
(SCC) are major subtypes of lung cancer - AC and SCC are distinct in survival, chances of
metastasis, and responses to chemotherapy and
targeted therapy. - Physicians lack confidence in correct recognition
when there are multiple primary carcinomas. - Training
- 58 ACs and 53 SCCs.
- 77 genes selected in the network.
- 25 signature genes.
10Bayesian Network for Lung Carcinoma
11Large-Scale Testing on Independent Samples
- 422 samples (232 ACs and 190 SCCs) aggregated
from 7 cohorts (including Caucasians,
African-Americans, Chinese). - Accuracy 95.2 AUROC.
12Comparisons with Other Popular Methods
- Higher classification accuracy.
- Small-sized signature to avoid overfitting.
13KRT6 Family Characterizes the Lung Carcinoma
Discrimination
14KRT6 Family Characterizes the Lung Carcinoma
Discrimination
- Keratin-6 family genes (KRT6A, KRT6B, KRT6C) are
important for distinguishing lung cancer subtypes.
- Accounting for 95 of the accuracy of the whole
25-gene signature. - Located on chromosome 12q12-q13.
- A nonlinear, concave discriminative surface.
15Verification by Chr12q12-q13 Aberrations
- Investigate DNA copy number changes in
comparative genomic hybridization (CGH) array.
- 12 ACs and 13 SCCs from Vrije University Medical
Center, Netherland. - A dumbbell discriminative surface achieves 80
classification accuracy. - Treat average CGH values of genes occupying q12,
q13, and q12-13 respectively as three features to
construct a Naïve Bayes Classifier.
16Conclusion
- Reverse engineer regulatory network information
for tissue classification. - Adopt the system biology approach to infer gene
dependencies network. - Select genes by Bayes factor.
- Eliminate collinearity via network learning.
- Integrate gene selection and classification model
in a single Bayesian network framework. - Demonstrate the promising translational value of
the system biology approach in clinical study.