Title: Emerging%20causal%20inference%20problems%20in%20molecular%20systems%20biology
1Emerging causal inference problems in molecular
systems biology
- Yi Liu, Ph.D.
- Beijing Jiaotong University
- The presented work was mainly collaborated with
- Prof. Jing-Dong Jackie Han, Dr. Nan Qiao, Dr. Wei
Zhang - _at_ CAS -Max Planck partner Institute for
Computational Biology - Prof. Min Liu, Dr. Jine Li
- _at_ Institute of Genetics Developmental Biology,
CAS
2Outline
- Background
- Mining biological knowledge from the big data
generated by the Next Generation Sequencing (NGS)
Technology - Examples of causal inference problems in biology
- 1) Inferring causal relationships between
transcription factors, epigenetic modifications
and gene expression level from heterogeneous deep
sequencing data sets - 2) Reverse-engineering the Yeast genetic
regulatory network from deletion-mutant gene
expression data - 3) Discovering subtypes of ovarian cancer and
uncovering key molecular signatures that
distinguish these subtypes.
3The need for integrating heterogeneousfunctional
genomic data sets
Yi Liu and Jing-Dong J. Han. Application of
Bayesian networks on large-scale biological data.
Frontiers in Biology, 2010, 5(2)98-104.
4SeqSpider A new Bayesian network inference
algorithm enabling integrative analysis of deep
sequencing data
Y Liu, N Qiao et al., Cell Research (2013)
Thanks for Prof. Jing-Dong Hans contribution to
the slides on this topic.
5Limitation of traditional BN learning approaches
In traditional BN structure learning approaches,
each node must take a discrete value. The only
exception is the Linear-Gaussian case. However,
this Parameterization is still very restrictive.
6Profiled signature of deep sequencing data
H3K4me3 profile
Deep sequencing data have distinctive profiled
signatures along the chromosomes, especially at
the gene promoter regions. However, there is no
way to utilize such information in the BN
learning algorithms.
mRNA profile
Liu et al, Nucleic Acids Res, 2010
7Profiles of hESC regulators around TSSs
In this work, we infer causal relationships
between transcription factors, epigenetic
modifications and gene expression level In
human/mouse embryonic stem cells.
8Heterogeneous data types in systems biology
Datasets type Details Data type Cell line Labs/Organizations
DNA methylation DNA methylation vector real value hES, H1 University of California, San Diego
Histone modifications H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me3, H3K9ac, H3K9me3 vector real value hES, H1 University of California, San Diego
Gene expression RNA-seq data real value hES, H1 University of California, San Diego
Transcript factor OCT4, KLF, MYC, TAFII, P300, SOX2, NANOG vector real value hES, H1 Ludwig Institute for Cancer Research
PRC complex EZH2 and RING1B vector real value hES, H9 Broad Institute of MIT and Harvard
More severely, there could be heterogeneous data
types in one systems biological
investigation. Handling multiple data-types
simultaneously in BN structure learning is not a
trivial task.
9Kernel-based surrogate dependency measures
In this work, we use the Kernel Generalized
Variance (F. Bach, JMLR 2002) to quantify the
joint dependence between heterogeneous variables,
which replace the mutual information-like
measures in BN learning.
10Kernels for heterogeneous types of data
Using Kernel Generalized Variance (F. Bach, JMLR
2002), to quantify the joint dependence between
heterogeneous variables, we only need to define
a kernel for each type of data.
Discrete Data
Real-valued Data
For vectored (profiled) Data, we define
11The L1-RPS kernel
12The L1-RPS kernel
13Motivation of the L1-RPS kernel
Bin-to-bin distances (such as Euclidean) are not
ideal ones to measure the discrepancy between
two sequence tag profiles. The Earth Movers
distance (EMD) computes the minimum mass
transportation efforts to deform one profile
to another.
The L1-RPS distance is equivalent to EMD when the
two profiles have equal mass. In other cases, it
also quantifies the total mass difference
between the two profiles while EMD not.
14Data Preprocessing Profile clustering
We use cluster centers of input data, instead of
each gene, as the training data to the BN
learning algorithm for noise reduction.
15Super k-means vs. k-means / Cluster 3.0
We propose the Super k-means algorithm to
perform clustering, which yields tighter clusters
than the k-means algorithm (in Cluster 3.0) and
the k-means algorithm.
Better clustering quality is necessary for the
final good BN learning result.
16The consensus PDAG network with feedbacks
Human Embryonic Stem Cells
We relax the acyclic constraint and perform
additional structure search after BN learning to
find potential feedback edges (as learning a
dependency network), since feedbacks are
important and ubiquitous in biology.
17Perfect ROC in Cross Validation
18ROC of alternative approaches
19Alternative clustering approaches for
preprocessing
Cluster 3.0
Affinity Propagation
20Alternative Kernels for BN learning
21CD4 T Cell network
22Mouse ESC network
23The proposed hub role of H3K4me3 in ESCs
24Functional Dissection of Regulatory Models Using
Gene Expression Data of Deletion Mutants
J Li, Y Liu et al., PLoS Genetics (2013)
25Gene Expression Data of Deletion Mutants
In this table, each column represents a deletion
mutant strain, and each row measures the
expression changes of a specific gene, 1
means up-regulation, -1 means down-regulation
and 0 means no specific change.
26Inferring Genetic Regulatory Networks
Our goal is to infer a genetic regulatory network
among the Deletion mutant genes However,
traditional Bayesian network learning approaches
failed Why? It is because the dominant value
in the deletion mutant gene expression data set
is 0, which quantity is magnitudes larger than
the 1 and -1 values. Using traditional
BN-learning metrics, such as K2, BDeu, BIC/MDL,
the huge intra-similarities between 0s will
overwhelm true regulatory signals.
27The DM_BN Kernel
To overcome this problem, we resort to
kernel-based BN learning. To this end, we
propose the DM_BN kernel. The key insight is to
block the intra-similarities between 0s
28Incorporating a priori causal information
We also use a template matrix to incorporate the
a priori knowledge from deletion-mutant
experiments into BN learning. If Gene B is in
the (influence) target list of Gene A, but not
the reverse case , we set (i, j) 1, (j, i) 0
in the template matrix to prohibit the
appearance of B-gtA in the BN. In this way, the
template matrix constraints the set of
plausible edges in a DAG. Finally, to convert a
DAG to a PDAG after BN learning, we must Resort
to Meeks rules Meek, 1995 to judge the
reversibility of Each edge, but not Chickerings
algorithm Chickering, 1995.
29High quality of the networks inferred by DM_BN
30Correctness of edge directions with/without
using templates
Without using the template matrix, DM_BN kernel
leads to 80 accuracy in the de novo inference
of edge directionalities, which is statistically
significant compared to random guessing.
31The inferred Yeast regulatory network
Online acyclicity checking is implemented to
enable learning large networks.
32Integrating Genomic, Epigenomic, and
Transcriptomic Features Reveals Modular
Signatures Underlying Poor Prognosis in Ovarian
Cancer
W Zhang, Y Liu et al., Cell Reports (2013)
Thanks for Dr. Wei Zhangs contribution to the
slides on this topic.
33The Cancer Genome Atlas (TCGA)
http//cancergenome.nih.gov/
34Summary of the Ovarian cancer data in TCGA
35Summary of the Ovarian cancer data in TCGA
- The copy number segmentation data were mapped to
the positions of genes and miRNAs. - Normalization
- Valuenorm (Valueraw Mediancontrols) /
STDpatients
36Scientific Questions
By combining the clinical and heterogeneous
high- throughput data, can we discover Ovarian
cancer subtypes whose outcomes are
different? Whether we can find active
regulatory pathways of the subtypes which could
explain their different prognosis?
37Selecting the Ovarian Cancer Hazard Factors
To investigate which features are related to the
prognosis of ovarian cancer, we first used Cox
proportional hazard model to perform the
regression analysis between each feature and
the patients survival time. In total we
selected 4,526 features as hazard factors (P lt
0.05), including 1,651 genes expression
changes, 455 genes promoter DNA methylation
changes, 140 miRNAs expression changes, and the
CNAs of 2,191 genes and 89 miRNAs.
38De novo discovery of ovarian cancer subtypes by
adaptive clustering
39Signatures of the 7 subtypes of Ovarian Cancer
These signatures were identified using Wilcoxon
rank-sum test.
40Enriched terms of subtype 2-specific
up-regulated genes
These terms, such as cell adhesion, TGF-beta
binding, angiogenesis and positive regulation of
cell proliferation, are related to tumorigenesis
and metastasis.
41Comparing the survival curves between subtype 2
and stage IV patients
The 5-year survival rate of subtype 2 was even
worse than that of tumor stage IV.
42The cancer knowledge base
Pathways in cancer Telomere maintenance Inflammatory response
MAPK signaling pathway VEGF signaling pathway Glycolysis / Gluconeogenesis
mTOR signaling pathway Wnt signaling pathway T cell receptor signaling pathway
ErbB signaling pathway ECM-receptor interaction B cell receptor signaling pathway
Jak-STAT signaling pathway Adherens junction Natural killer cell mediated cytotoxicity
Cytokine-cytokine receptor interaction Focal adhesion
Cell cycle p53 signaling pathway
PPAR signaling pathway Base excision repair
TGF-beta signaling pathway Mismatch repair
Apoptosis Nucleotide excision repair
The hallmarks of cancer
Hanahan Weinberg 2011
Used to filter out signature genes that are not
drivers of cancer.
43The interaction network of signature genes
44THANKS