Title: Analysis and Integration of Large-scale Molecular and Clinical Data in Cancers
1Analysis and Integration of Large-scale Molecular
and Clinical Data in Cancers
- Sampsa Hautaniemi, DTech
- Systems Biology Laboratory
- Institute of Biomedicine
- Genome-Scale Biology Research Program
- Centre of Excellence in Cancer Genetics
- Faculty of Medicine
- University of Helsinki
2Table of Contents
- The essence of systems biology Iteration and
collaboration. - Iteration in ovarian cancer.
- The essence of systems biology II Multi-level
data. - Multi-levelity of breast cancer.
- The essence of systems biology III Computation.
- Anduril computational framework glioblastoma
multiforme.
3Systems Biology Iteration
Adapted from a slide by Peter Sorger
4Ovarian Cancer
- Epithelial ovarian cancer is the fifth most
frequent cause of female cancer deaths, with an
overall 5-year survival rate below 50. - The standard chemotherapy for high-grade serous
ovarian cancer (HGS-OvCa) is platinum-taxane
combination. - Majority of patients suffer relapse lt18 months.
- No clinically applicable methods to predict the
prognostic outcome or even to identify the
patients unresponsive to current therapies.
5Aims of the HGS-OvCa Study
- To identify poor response and good response
subtypes of HGS-OvCa. - Report biomarkers that allow to identify whether
a HGS-OvCa patient responds to the platinum
treatment. - We developed a computational method that
integrates transcriptomics and clinical data in
subtype finding step. - We used transcriptomics and clinical data from
184 HGS-OvCa patients treated with platinum and
taxane from TCGA repository.
6Three Subtypes of HGS-OvCa
Chen et al. In preparation.
7Validation, validation, validation
- We also used an independent prospective HGS-OvCa
cohort of 29 patients. - Data measured with qRT-PCR.
Chen et al. In preparation.
8Pathway Analysis
- Our pathway analysis (too) identified TR3 as a
potential driver for platinum resistance.
9TR3 Inhibition with Two Drugs
- We identified two signaling pathway regulators
for TR3 and associated inhibitors. - The use of two inhibitors should transform the
HGS-OvCa cells sensitive to platinum.
AKT inh
AKT inh ERK5 inh
Chen et al. In preparation.
10Systems Biology II Multi-level Data
- While cancer cells are clearly visible the exact
molecular causes for are still unknown. - Need to study cancer samples at multiple levels.
11Multiple Levels of Data
100 samples lead to 200 million data points.
12Multiple level data Estrogen Receptor
13Why Is This Important?
- Estrogen receptor is the most important clinical
variable in determining how to treat a breast
cancer patient. - There are several anti-cancer drugs targeting
estrogen receptor pathway. - Currently unknown which tumors do not response to
therapy. - Finding genes respond to estrogen receptor
stimulus may give clues which genes are important
in ER inhibition resistance.
Hugo Simberg Garden of Death
14Data
- We used chromatin immunoprecipitation combined
with massive parallel sequencing (ChIP-seq) to
determine genome-wide occupancy (eight time
points) after estradiol stimuli in MCF-7 breast
cancer cell line - Estrogene receptor a
- RNA polymerase II
- Histone marks (H3K4me3, H2A.Z)
- These experiments resulted in gt2.0 billion data
points to the initial analysis.
15SYNERGY database
- SYNERGY database is available and fully
operational. - http//csblsynergy.fimm.fi/
16Finding ER Responsive Genes
17Results
- We identified 777 estrogen receptor early
responding genes. - Interestingly, the major estrogen receptor
related changes in cells were due to non-genomic
action.
18Results
- Next we searched for genes that have survival
association in a breast cancer cohort of 150
ER/HER2-/postmenopausal patients in The Cancer
Genome Atlas (TCGA) cohort. - Based on Kaplan-Meier analysis we identified 23
genes with survival plt0.05. - The best survival associated gene was ATAD3B.
19Kaplan-Meier for ATAD3B
20Intermission
- Pol2 activity is much better way of searching for
responsive genes to a cue that mRNA. - In deep sequencing, the sequencing depth is
important (with our 200 mill. short-read Pol2
data, we found many ER responsive genes not found
in 20 mill. short-read GRO-seq). - How to systematically analyze multi-level data?
21Multi-level Cancer Research Requires
Computational Methods
- Storing the data and computing power are the
first (but relatively small) hurdles. - Analysis of large-scale, heterogeneous data is
much more challenging than single genomics or
proteomics data analysis. - There is a need for computational infrastructure.
- Writing an analysis program fast without proper
infrastructure will lead to delays and errors in
larger projects.
22Infrastructure Anduril
- Anduril is a computational framework to integrate
large-scale and heterogeneous data, knowledge in
bio-databases and analysis tools. - The main design principles are
- Modular pipeline analysis approach
- Scalable
- Open source, thorough documentation
- http//www.anduril.org/
- Method written in any programming language
executable from the command prompt can be
included. - Produces automatically the result PDF and website
containing the results.
23Complex Pipelines Are Fragile
24Glioblastoma Multiforme (GBM)
- Glioblastoma multiforme (GBM) is one of the
deadliest cancers. - The Cancer Genome Atlas (TCGA) has published data
from gt500 GBM patients - comparative genomic hybridization arrays
- single nucleotide polymorphism arrays
- exon and gene expression arrays
- microRNA arrays
- methylation arrays
- clinical data
- Which genes or genetic regions have survival
effect?
25GBM Results in Anduril Website
26Latest on moesin in GBM
27(Sequence) Component Libraries
- Over 400 Anduril components already available.
- Pipelines
- ChIP-seq (EMBO J 2011, Cancer Res 2012, ...)
- RNA-seq (not published)
- miRNA-seq (not published)
- DNA methylation-seq (not published)
- Whole-genome sequence exome-sequence (not
published) - Image analysis (manuscript)
28Summary
- Characterization of a complex disease first
requires identifying the key variables. - This requires integration data from multiple
levels, iterative mode of research and
collaboration. - Multi-level data integration requires
computational infrastructure and data-intensive
computing. - We have developed Anduril to organize large-scale
data analysis projects (imaging, deep sequencing,
database usage, conversions, etc.) - The need for computational infrastructure is
evident in particular when analyzing deep
sequencing data. - All our methods are (will be) freely available.
http//research.med.helsinki.fi/gsb/hautaniemi/sof
tware.html
29Acknowledgements
Systems Biology Lab
Funding Academy of Finland Finnish Cancer
Organizations Sigrid Jusélius Foundation EU
FP7 ERA-NET SysBio Biocenter Finland Biocentrum
Helsinki
Collaborators Olli Carpén Henk
Stunnenberg George Reid Jukka Westermarck