Title: Introduction to
1Introduction to Gene expression profiling
Christine Desmedt PhD Translational Research
Unit Université Libre de Bruxelles Institut Jules
Bordet Brussels, Belgium
2- Introduction
- The technique
- Standardization
- Bio-informatics
- Use of Gene expression profiling in breast cancer
3Introduction
4From genes to proteins
Replication
Transcription
Translation
DNA
Protein
RNA
Gene expression
Protein expression
SNPs Mutations Amplifications
Deletions Gene Fusions Chromosomal
aberrations
5Sequence of the GENOME
Progress in technology and Bio-informatics
Era of molecular medicine
Gene expression profiles
6Microarray experiments
- Advantages
- Measure expression of several thousands of genes
simultaneously. - Possibility to discover important pathways and
genes relevant to the clinical problem - Examine snapshot of gene expression of tumor
and environment, rather than genes individually - Disadvantages
- Huge volume of data produced- chance of false
discoveries- fishing expeditions. - How to interpret and generate useful biological
information remains a significant challenge - Clinical validation and utility of findings still
remains unclear
7The technique
8Microarray platforms
- Probe implementation
- Full-length cDNA
- Oligonucleotides (presynthesized, in situ
synthesized) - Target-labeling strategies
- Single-color detection
- Dual-color detection
9Probe implementation
cDNA
Oligonucleotide
Short (Affymetrix)
Long (Agilent)
10Full-length CDNA Printing in house
384 well Plate
slide
Platter 100 slides
11Printinh head and pin
12Short oligonucleotide array Affymetrix
13Short oligonucleotide array Affymetrix
Fully integrated instrument that maximizes data
reproducibility and laboratory productivity by
minimizing user intervention
14Signal detectionDUAL color detection
15Normal
Wavelength 635
Tumor
Wavelength 532
Composite image
16Normal tissue Wavelenght 635 nm
1 2 3
1 2 3
Tumor Wavelenght 532 nm
1 2 3
Ratio (635 nm/532 nm)
17Image Analysis
18Results
19Signal detectionSINGLE color detection
(Affymetrix)
Streptavidin-phycoerythrin (SAPE) Counterstained
with biotinylated anti-streptavidine
20Signal detectionSINGLE color detection
(Affymetrix)
47,000 transcripts !
21Signal detectionSINGLE color detection
(Affymetrix)
22Signal detectionSINGLE color detection
(Affymetrix)
23Standardization
24Standardization (1)
- MIAME guidelines
- Information About a Microarray Experiment
checklist helps define the level of detail that
should exist and is being adopted by many
journals as a requirement for the submission of
papers incorporating microarray results.
25Standardization (2)
MIAME (Minimal Information about a Microarray
Experiment) http//www.mged.org/Workgroups/MIAME/
miame.html
26Short oligonucleotide array Affymetrix
Standardization (2) public repositories
- Gene Expression Omnibus (GEO)
- (http//www.ncbi.nlm.nih.gov/geo/)
- a gene expression/molecular abundance repository
supporting MIAME compliant data submissions, and
a curated, online - resource for gene expression data browsing,
query and retrieval. - Array Express
- (http//www.ebi.ac.uk/arrayexpress/)
- ArrayExpress is a public repository for
microarray data, which is aimed at storing
MIAME-compliant data in accordance with
recommendations. The ArrayExpress Data Warehouse
stores gene-indexed expression profiles from a
curated subset of experiments in the repository.
27Short oligonucleotide array Affymetrix
Standardization (2) public repositories
28Short oligonucleotide array Affymetrix
Standardization (2) public repositories
29Standardization MIAME
- The raw data for each hybridisation (e.g., CEL or
GPR files) - The final processed (normalised) data for the set
of hybridisations in the experiment (study)
(e.g., the gene expression data matrix used to
draw the conclusions from the study) - The essential sample annotation including
experimental factors and their values (e.g.,
compound and dose in a dose response experiment) - The experimental design including sample data
relationships (e.g., which raw data file relates
to which sample, which hybridisations are
technical, which are biological replicates) - Sufficient annotation of the array (e.g., gene
identifiers, genomic coordinates, probe
oligonucleotide sequences or reference commercial
array catalog number) - The essential laboratory and data processing
protocols (e.g., what normalisation method has
been used to obtain the final processed data)
30Standardization (3) MAQC
- MAQC initiative
- MicroArray Quality Control (MAQC) Project" is
being conducted by the FDA to develop standards
and quality control metrics which will eventually
allow the use of MicroArray data in drug
discovery, clinical practice and regulatory
decision-making.
31- 4 mRNA samples
- 5 replicates
- 6 microarray platforms
- 3 laboratories
Nature Biotechnology, 24, 9, 1151-1161, 2006
32Coefficient of variation
- 4 mRNA samples
- 5 replicates
- 6 microarray platforms
- 3 laboratories
CV 5-15 within laboratories 10-20 between
laboratories
Nature Biotechnology, 24, 9, 1151-1161, 2006
33MAQC Findings
Microarray data are
- Repeatable within a laboratory
- Reproducible across laboratories
- Concordant across platforms
- Comparable with quantitative technologies, e.g.,
QPCR - Reflective of biology regardless of the
differences in technology.
if we look at differential gene expression in
terms of fold-change (FC) ranking.
34Two Phases of the MAQC Project, Addressing Two
Types of Microarray Applications
I. Class Comparison What makes the two
populations different?
Differentially Expressed Genes (DEGs)
MAQC-I
Better understanding of the biological mechanisms
II. Class Prediction Can the outcome of new
individuals be predicted?
Predictive Models (Classifiers)
MAQC-II
Diagnosis, treatment outcome, prognosis,
personalized medicine
35Bio-informatics and examples
36Collection, transformation and representation of
the data
Raw data (single, dual-color)
Background correction, data transformation,Normali
zation (differences in labeling, hybridization
and detection methods)
filtering (elimination of genes with minimal
variance)
37Development of an expression matrix
38Unsupervised analyses
Supervised analyses
Discover classes oftumors/specimens or genes
Discover genes associated with phenotype and
building of prediction model
39Discover classes oftumors/specimens or genes
Unsupervised analyses
Discover classes oftumors/specimens or genes
- Cluster analysis algorithms
- Hierarchical
- K-means
- Self-Organizing Maps
- Multitude of others
40Unsupervised analysis clustering
ie we measure the expression of 3 genes for a
set of patients samples are displayed in this
gene space
axis gene expression level
41Unsupervised analysis clustering
Patients can be grouped in three different
clusters
axis gene expression level
42Unsupervised analysis clustering
- Widely used Hartigan, 1975 Eisen et al., 1998
- Organizing objects in a hierarchical tree
(dendrogram) based on their degree of
dissimilarity - Linkage distance between two clusters of
objects - Assess quality stability and robustness
43Unsupervised analysis EXAMPLE
- 65 human BC samples from 42 individuals
- 20 patients had profiles before and after CT
- Unsupervised hierarchical clustering method was
used to group genes on basis of similarity of
patterns of gene expression - Genes are ranked vertically and samples
horizontally - Found that ER status was a major discriminator of
subtypes - Breast tumors show great variation in the gene
expression - Gene expression is multi-dimensional- ie many
different gene sets are differently expressed - Tumor samples from the same patient clustered
together
Perou et al. 2000
44Supervised analysis
- Class Comparison
- To compare the gene expression profiles of 2 or
more groups of patients - Statistical test
- -binary class t-test, Wilcoxon rank sum test
- more than 2 classes ANOVA, Kruskal-Wallis test
- Significance p-value
- Multiple testing
- - many hypotheses are tested simultaneously
- - Example 10,000 genes and p-valuelt0.05 500
false positives - Correction needed
- Problems with traditional methods (ex
Bonferroni) - Most assume variable independence
- Many are considered too stringent
- Trade off between biological information and
false positive
45Supervised analysis
The apparent lack of reproducibility in
identifying differentially expressed genes across
different platforms and sites ? P-value ranking
only!
FC-ranking should be used in combination with a
nonstringent P threshold to select a DEG list
that is reproducible, specific, and sensitive,
and a joint rule is recommended as a baseline
practice.
46Supervised analysis
Class prediction to create a multi-gene predictor
Split data set randomly into training set and
test set
Training set
Test set
1. Identify discriminating genes between two
groups of interest
6. Validate predictor accuracy on independent data
2. Construct classifier by combining genes with
predictive machine learning algorithms
3. Estimate classification error rate in
leave-one-out cross validation by repeating 12
4. Select best classifier
5. Test significance in permutation test
47Supervised analysis EXAMPLE
70-gene signature
- Found 231 genes correlated to DM
- Ranked in order of significance (p value)
- 70 genes were chosen
- Validation in 19 patients, 17/19 correctly
predicted - Established clinical utility- could outperform St
Gallen and NCI criteria by predicting who needed
CT (ie who relapsed) and who did not
good signature
78 tumors
poor signature
Vant Veer et al., Nature, 2002
Validation series n151 node patients ( 144
node patients) Van de Vijver, NEJM, 2002
48Supervised analysis EXAMPLES
META-ANALYSIS of PUBLICLY AVAILABLE DATA
DIFFERENT PROGNOSTIC GENE SIGNATURES
- Proliferation is the common denominator of the
different signatures - These signatures are mainly performant in ER
patients - Immune response and tumor invasion may
differentiate tumors with better and worse
prognosis in ER- and HER2 patients
Mammaprint
Genomic Grade
76-gene signature
Oncotype
Wound signature
49Independent validation studies
- Role to confirm the results of a previous study,
in order to reduce play of chance and the
potential for bias - Ransohoff, Nat Rev Cancer 2004, 2005
- Common mistakes
- Include part of the initial sample of patients
- To include other types of patients
- To use another measurement technique RT-PCR
- To change the prediction rule to adapt it to the
new set playing with data- different algorithm,
different cut-off, changing genes etc.
50JNCI 2007
Critical review of 90 outcome-related
statistical analyses of microarray studies
published between 2000 and 2004
Development of a check-list
1. Need for clear objectives and study objectives
should influence pt selection! 2. Class-discovery
methods nor really suited for outcome-related
analyses, more for ex to elucidate pathways 3.
Class comparison analyses are appropriate when
outcomes are discrete. If outcome is survival we
lose information by making discrete groups. 4.
Data used for developing predictor should be
distinct from data used to validate it.
51Some easy tools
BRB-ArrayTools Developed by Richard Simon
BRB-ArrayTools Development Team
52Molecular classification
Tumor
microenvironment
CTCs
Prediction of treatment efficacy
Understanding the biology
53Questions are welcome!
54Back up
55Better understanding of Tumor Biology
56Understanding biological mechanisms
- Extracting biological insight from microarray
data remains a major challenge - Often long gene lists are produced, these genes
change with different datasets - After correcting for multiple testing, no
significant genes may be found - Single gene analysis may miss important pathway
effects - Often highly depends on laboratory/supervisors
area of expertise
57Gene set enrichment analysis
- Gene sets used rather than single genes
- Fold change in all genes in one gene set is
significant rather than a dramatic single gene
fold change - Cellular processes often affect sets of genes
Genes are ranked based on correlation with a
phenotype Enrichment Score is caluculated which
reflects degree to which the gene set is
overrepresented at the extremes
Mootha V et al, Nat Gen 2003 Subramanian A et
al, PNAS 2005, Segal et al, Nat Gen 2004
58Ingenuity Pathways Analysis
- View gene lists within framework of functional
networks - Protein-protein interactions curated from the
literature - Generate hypotheses for experimental validation
www.ingenuity.com
Top pathway insulin receptor signaling
59Connectivity Map
- Database of gene expression profiles of common
cell lines treated with drugs. - Multiple batches, different doses
- Connect gene signatures with signatures of drug
response
http//www.broad.mit.edu/cmap/ Lamb J et al,
Science 2006
60Publically available dataset of MCF7 cells were
treated with estradiol and profiled. Their
profiles were highly correlated with those in the
Connectivity map of MCF7 cells treated with
estradiol and negatively correlated with
anti-estrogens
61Clusters, if they exist, are consistent
- Different methods look at different structures in
data. However, if the separation is clear, the
resulting clusters should be similar.
62Illumina
Low sample input requirements Just 50100 ng of
total RNA required Low per-sample cost Less
than half the price of other commercial arrays
63(No Transcript)