Gene%20expression%20studies%20of%20cancer:%20gene%20transcription%20signatures - PowerPoint PPT Presentation

About This Presentation
Title:

Gene%20expression%20studies%20of%20cancer:%20gene%20transcription%20signatures

Description:

Gene expression studies of cancer: gene transcription signatures – PowerPoint PPT presentation

Number of Views:441
Avg rating:3.0/5.0
Slides: 113
Provided by: cre149
Learn more at: http://www.genboree.org
Category:

less

Transcript and Presenter's Notes

Title: Gene%20expression%20studies%20of%20cancer:%20gene%20transcription%20signatures


1
Gene expression studies of cancer gene
transcription signatures
  • Chad Creighton
  • February 2009

2
Oncogenic signaling pathways in cancer
Mutation/deregulation of a handful of genes can
make cells into cancer cells.
Hanahan and Weinberg. Cell. 2000 10057-70
3
(No Transcript)
4
Widespread deregulation of gene expression in
cancer
  • Gene expression profiling distinguishes prostate
    cancer from normal prostate and from BPH.

Dhanasekaran et al. Nature. 2001 Aug
23412(6849)822-6.
5
Widespread deregulation of gene expression in
cancer
  • Gene expression profiling identifies different
    subtypes of breast cancer.

Sorlie et al. PNAS. 2003 100(14)8418-23
6
A gene-expression signature as a predictor of
survival in breast cancer
www.agendia.com
Van de Vijver et al. NEJM 2002 347(25)1999-2009.
7
A 21-gene assay to predict recurrence of breast
cancer
Paik et al. NEJM 2004 351(27)2817-26.
8
Oncogenic pathway signatures in human cancers as
a guide to targeted therapies
  • Use oncogenic signatures to predict response of
    cell lines to targeted therapy.

Bild et al. Nature. 2006 439(7074)353-7.
9
Oncogenic signatures of ERBB2, EGFR, MEK, RAF,
and MAPK in breast cancer cells
Creighton et al. Cancer Res. 2006 66(7)3903-11.
10
Preliminary gene expression profiling studies of
cancer
  • Hundreds of genes are deregulated in cancer.
  • Different subtypes of cancer are defined by gene
    expression profiling.
  • Gene expression signatures may predict cancer
    patient survival.
  • Gene expression signatures of oncogenic signaling
    pathways can be defined using experimental models
    (cell lines, mice).

11
Potential uses for gene expression profiling of
cancer
  • Define and understand the molecular pathways that
    underlie cancer.
  • Define subgroups of patients for the purposes of
    optimizing treatment.
  • Determine whether or not a patient would benefit
    from a given therapy (e.g. chemotherapy).
  • Determine what specific pathways are deregulated
    in the tumor and treat the tumor with therapies
    that target that pathway (e.g. hormone therapy
    for ER breast cancer).

12
General concepts of gene expression analysis
  • Low level analysis
  • Processing image files
  • Normalization
  • Quality Control (QC)
  • High level analysis
  • Clustering
  • Selecting differentially expressed genes
  • Enrichment analysis or Meta-analysis

13
Publicly available gene expression profile data
represents a rich resource
  • When publishing studies using gene expression
    profile data, authors are encouraged to make the
    data available to everyone.
  • Subsequent studies can re-analyze the data with
    different questions in mind from what the
    original authors had.

14
  • GEO database (http//www.ncbi.nlm.nih.gov/geo/)
    make thousands of expression profile datasets
    publicly available.
  • Many top journals require microarray studies to
    make data public on GEO

15
Pathway-related gene sets Gene Ontology (GO)
terms
  • The Gene Ontology project provides a controlled
    vocabulary to describe gene attributes.
  • Three major categories
  • Cellular component
  • Biological process
  • Molecular function
  • The controlled vocabularies are structured so
    that they can be queried at different levels
  • For example, use GO to find all gene products
    involved in signal transduction, or zoom in on
    all receptor tyrosine kinases.

www.geneontology.org
16
Pathway-related gene sets Molecular Signature
Database (mSigDB)
  • From the Broad Institute
  • Collection of gene sets curated from the
    literature (including gene expression profiling
    studies).
  • Current version represents over 1800
    pathway-associated genes sets

http//www.broad.mit.edu/gsea/msigdb/index.jsp
17
Gene signatures
  • Will be loosely defined here to mean a set of
    genes that are functionally associated with each
    other in some way.
  • Ways to define gene signatures
  • Gene annotation (e.g. Gene Ontology terms)
  • Curated pathway-associated gene sets
  • Literature review articles
  • Gene expression signature, gene signature
    defined using expression profiling data
  • e.g. what genes go up or down in response to
    treatment in an experimental model)

18
Gene expression signatures
  • When using expression profiling to define genes,
    a gene expression signature consists of two
    things
  • A set of genes going up (relative to
    something).
  • A set of genes going down (relative to
    something).
  • Relative direction of the genes (up-regulated vs
    down-regulated, or over-expressed vs
    under-expressed) is important.
  • Keep the up genes separated from the down
    genes.

19
How do we relate gene expression profile results
from different datasets to each other?
20
Methods for determining enrichment of gene
signatures within the overall patterns of another
expression profile dataset
OR How do we relate gene expression profile
results from different datasets to each other?
21
The enrichment problem
  • A Given a gene set or sets of interest.
  • i.e. a gene signature
  • B Given an independent expression dataset with
    the profiled genes being ranked by a specified
    metric.
  • e.g. cancer vs. normal or correlation with
    MYC.
  • Are the genes in (A) enriched within (B)?
  • i.e. do the results of (A) and (B) overlap
    significantly?

22
Methods for determining enrichment
  • Venn diagram, or marble jar approach
  • Take the top set of genes from the expression
    dataset (dataset B), tabulate the amount of
    overlap with the independent gene set of interest
    (dataset A).
  • Rank-based approach
  • Use the entire dataset, including genes of
    borderline significance or showing a weak trend
    towards significance.
  • Correlation approach
  • For a set of genes, compute correlation between
    two sets of weighting factors (based on different
    profiling datasets).

23
Venn diagram enrichment analysis
  • Requires us to make a cut to define what the
    top genes are.
  • Significance of overlap may be determined by
    chi-square or one-sided Fishers exact tests.

24
Venn diagram enrichment analysis
Define gene set of interest
  • Requires us to make a cut to define what the
    top genes are.
  • Significance of overlap may be determined by
    chi-square or one-sided Fishers exact tests.

25
Venn diagram enrichment analysis
Define differentially expressed genes
  • Requires us to make a cut to define what the
    top genes are.
  • Significance of overlap may be determined by
    chi-square or one-sided Fishers exact tests.

26
Venn diagram enrichment analysis
Determine overlap between the two gene sets
  • Requires us to make a cut to define what the
    top genes are.
  • Significance of overlap may be determined by
    chi-square or one-sided Fishers exact tests.

27
Hypergeometric formula (one-sided Fishers exact
test)
  • Number of genes in total population G
  • Genes in G falling under pre-defined class A
  • Number of genes selected k
  • Number of selected genes k in class A n
  • The number of genes expected to overlap by
    chance (k X A)/G
  • One-sided Fishers exact test determines whether
    n is significantly greater than (kXA)/G

28
Hypergeometric formula (one-sided Fishers exact
test)
  • Number of genes in total population G
  • Genes in G falling under pre-defined class A
  • Number of genes selected k
  • Number of selected genes k in class A n
  • The probability P for the term occurring n or
    more times within a set of k genes randomly
    selected from the population

29
What is the total gene population (G)?
  • Can represent the number of genes profiled on the
    array chip.
  • What if two different array platforms were used
    (a different set of genes are typically
    represented in each)?
  • Use the common set of genes represented on both
    array chips as the total population (do not
    consider genes not represented on both arrays)
  • Use ONE of the two array platforms to define the
    gene population (do not consider genes on the
    other array platform that are not represented on
    the first platform)

30
A gene signature of mutation of EGFR in NSCLC
cell lines
  • Compared lung cancer cell lines with or without
    an activating mutation in EGFR.
  • Wanted to compare this gene signature with
    another gene signature of EGFR

Lung cancer cell lines
Choi, Creighton, et al., PLoS ONE 2(11) e1226.
31
Oncogenic signatures of ERBB2, EGFR, MEK, RAF,
and MAPK in breast cancer cells
  • Does the published MCF-7EGFR signature overlap
    with the NSCLC EGFR signature?

Creighton et al. Cancer Res. 2006 66(7)3903-11.
32
Compare NSCLC EGFR mutant signature with a
signature of EGFR-transfected MCF-7 cells
  • EGFR wt NSCLC genes 119
  • MCF7 EGFR genes 1152
  • Genes shared between MCF7/NSCLC array platforms
    11079
  • Genes shared between MCF7/NSCLC gene signatures
    44

significance of overlap plt1E-10
One-sided Fishers exact test
Choi, Creighton, et al., PLoS ONE 2(11) e1226.
33
A gene signature of mutation of EGFR in NSCLC
cell lines is enriched with EGFR-depended genes.
Choi, Creighton, et al., PLoS ONE 2(11) e1226.
34
Experimental models versus clinical tumors
  • Molecular data from experimental models represent
    dynamic information, but clinical relevance is
    not always clear (e.g. could represent
    experimental artifacts).
  • Data from clinical tumor specimens represent more
    static information, where the associations
    observed may be pathologically relevant.

35
Experimental models versus clinical tumors
  • From clinical data, cannot distinguish
    cause-and-effect associations from correlation
    alone.
  • In cancer studies, important to combine the
    experimental with the clinical.
  • Some researchers may doubt the validity of
    experimental results unless they can be shown to
    apply to human tissues

36
Ranked-based enrichment analysis
Locations of genes from set B
Rank ordered genes from dataset A
  • Rank-based approaches use all of the genes from
    one of the datasets to determine enrichment (does
    not make a cut).

37
GSEA (rank-based) enrichment analysis
38
GSEA (rank-based) enrichment analysis
All the genes in the dataset are used here
Subramanian, Aravind et al. (2005) Proc. Natl.
Acad. Sci. USA 102, 15545-15550
  • Start from the top of the Ranked list.
  • Add points to Random walk for each gene you
    find in S.
  • Remove points from Random walk for each gene
    not in S.

39
GSEA Kolmogorov-Smirnov statistic
Consider the genes R1,.., RN that are ordered on
the basis of the difference metric between the
two classes and a gene set S containing G
members. We define                             
                         if Ri is not a member
of S, or                                       
               if Ri is a member of S.We then
compute a running sum across all N genes. The ES
is defined as                                  
                    or the maximum observed
positive deviation of the running sum.
40
GSEA Kolmogorov-Smirnov statistic
  • The ES score (the peak of the Random walk) is
    just a number.
  • Need to evaluate the significance of the number
    by some type of permutation testing
  • Permute the sample labels many times, OR
  • Permute the gene sets (i.e. randomly generate
    gene sets).
  • In either case, compare distribution of scores
    from random tests with the actual score.

41
GSEA (rank-based) enrichment analysis
Subramanian, Aravind et al. (2005) Proc. Natl.
Acad. Sci. USA 102, 15545-15550
Examples of GSEA running enrichment scores
42
GSEA (rank-based) enrichment analysis
Subramanian, Aravind et al. (2005) Proc. Natl.
Acad. Sci. USA 102, 15545-15550
Sets with genes not located at the top of the
ranked gene population may still yield
significant enrichment scores.
43
A mechanism of cyclin D1 action encoded in the
patterns of gene expression in human cancer
Lamb, et al. Cell 114323-34, 2003
44
The Connectivity Map of gene signatures induced
by 164 different small molecule inhibitors
Lamb et al., Science. 2006 313(5795)1929-35
45
The Connectivity Map
(Scoring derived from GSEA statistic)
46
Q1-Q2 analysis (another ranked based approach)
  • Q1 Compare enrichment pattern to that for
    randomly select gene sets
  • Q2 Compare enrichment pattern to that for
    randomly permuted labels in the reference profile
    dataset

Tian, et al. PNAS 10213544-13549, 2003
47
A gene expression signature of Akt overexpression
from a transgenic mouse model
Majumder et al. Nat Med 10 594601, 2004
48
Creighton CJ, Oncogene. 2007 264648-55
49
Venn diagram vs Rank-based methods
  • Venn diagram results more easily interpretable.
  • For rank-based methods, genes that are not at all
    significant individually may contribute to
    enrichment.
  • What gene do you go after for validation?
  • With venn diagram, have to make a cut.
  • May not include enough genes in the test.

50
Venn diagram vs Rank-based methods
51
Venn diagram vs Rank-based methods, what is a
significant p-value?
  • If using the Venn diagram method in expression
    studies, p-value should be very low if working
    with sizable gene sets (e.g. lt1E-6).
  • If using rank-based method, can consider a
    nominally significant p-value (e.g. plt0.05) to be
    good if permuting the sample labels is involved.
  • Can always try both ways in order to be certain
    of an enrichment association.

52
Rank-based Q1-Q2 versus GSEA
  • Q1-Q2 enrichment score is much simpler
  • Take the sum of the t-statistic values for each
    gene in the set.
  • GSEA scoring is more complicated.
  • GSEA has user-friendly public software
    (http//www.broad.mit.edu/gsea/)
  • No software yet for Q1-Q2, have to write your own.

53
Correlation-based approach
  • Take the correlation between two sets of
    profiling results from different datasets.
  • May use all of the genes profiled or a specified
    subset (e.g. genes in a gene signature).
  • The correlation metric may be any one of a number
    of valid metrics (e.g. Pearsons or Spearmans
    rank).

54
Correlation-based approach
  • Each gene used in the correlation may be
    weighted in a number of ways
  • t-statistic, comparing two groups
  • Mean-centered expression values
  • 1 or -1 for up or down, respectively
  • Again, direction of the genes is important
  • Positive correlation indicates similar overall
    patterns between the two datasets.
  • Example IGF activation score from Creighton et
    al., JCO 2008.

55
Example analyses comparing gene transcription
signatures from different studies
56
Gene expression signatures of oncogenic pathways
from published studies
  • Includes
  • MYC
  • c-Src
  • beta-catenin
  • Cell cycle
  • cyclin D1
  • E2F3
  • androgen
  • estrogen
  • Ras
  • Akt
  • erbB-2
  • MEK
  • EGFR
  • Raf
  • MAPK

57
Multiple public gene expression profile datasets
of prostate tumors
58
Experimentally-derived oncogenic signatures in
human prostate cancer
  • Are there patterns of interest shared between the
    oncogenic signatures and the prostate tumors?
  • Examine gene that are high/low with oncogene
    expression in the human tumors.
  • Is the corresponding oncogenic signature enriched
    in those tumors.
  • Use both Q1-Q2 (rank based) and one-sided
    Fishers exact (Venn diagram) methods

59
A mechanism of cyclin D1 action encoded in the
patterns of gene expression in human cancer
Lamb, et al. Cell 114323-34, 2003
Use same idea from Lamb et al., only look at
multiple signatures in multiple prostate tumor
datasets
60
Genes up-regulated by a specific oncogene in
experimental models are co-expressed as a group
with the oncogene in clinical prostate tumors
Enrichment results using Q1-Q2 rank-based method
61
Number of genes overlapping between oncogenic signatures and clinical tumor correlates. Number of genes overlapping between oncogenic signatures and clinical tumor correlates. Number of genes overlapping between oncogenic signatures and clinical tumor correlates. Number of genes overlapping between oncogenic signatures and clinical tumor correlates. Number of genes overlapping between oncogenic signatures and clinical tumor correlates. Number of genes overlapping between oncogenic signatures and clinical tumor correlates.
Oncogenic signature Signature genes Prostate tumor genes Expected overlap Actual overlap P-value
androgen_up_Chen 559 176 7 29 1.2E-11
Myc_up_Bild 993 150 10 33 5.5E-10
Src_up_Bild 1566 118 12 23 0.002
erbB-2_up_Creighton 1315 307 27 91 1.1E-26
EGFR_up_Creighton 734 28 1 1 0.75
cyclin_D1_up_Lamb 206 139 2 8 0.0006
Akt_up_Majumder 770 280 14 47 3.8E-13
           
Number of genes positively correlated with corresponding oncogene/biomarker in human prostate tumors (Criterion plt0.05 in at least three out of four profile datasets). Number of genes positively correlated with corresponding oncogene/biomarker in human prostate tumors (Criterion plt0.05 in at least three out of four profile datasets). Number of genes positively correlated with corresponding oncogene/biomarker in human prostate tumors (Criterion plt0.05 in at least three out of four profile datasets). Number of genes positively correlated with corresponding oncogene/biomarker in human prostate tumors (Criterion plt0.05 in at least three out of four profile datasets). Number of genes positively correlated with corresponding oncogene/biomarker in human prostate tumors (Criterion plt0.05 in at least three out of four profile datasets). Number of genes positively correlated with corresponding oncogene/biomarker in human prostate tumors (Criterion plt0.05 in at least three out of four profile datasets).
By one-sided Fisher's exact test. By one-sided Fisher's exact test. By one-sided Fisher's exact test. By one-sided Fisher's exact test. By one-sided Fisher's exact test. By one-sided Fisher's exact test.
62
A gene signature of Insulin-like growth factor I
(IGF-I)
  • Substantial evidence implicates insulin-like
    growth factor I (IGF-I) signaling in the
    development and progression of breast cancer.
  • Gene expression profiling of IGF-I-stimulated
    MCF-7 cells was performed.
  • An IGF-I gene signature was examined in human
    breast tumors, as well as in experimental models
    for specific oncogenic signaling pathways.

Creighton CJ, et al., Lee AV. JCO. 264078-85.
63
Genes altered by IGF-I at 3hr or 24hr or both
64
A gene signature of Insulin-like growth factor I
(IGF-I)
65
(No Transcript)
66
Oncogenic pathway signatures in human cancers as
a guide to targeted therapies
  • Examine previously published dataset for
    oncogenic signatures overlapping with IGF
    signature

Bild et al. Nature. 2006 439(7074)353-7.
67
The IGF is enriched for transcriptional targets
of the Ras pathway
68
The Connectivity Map of gene signatures induced
by 164 different small molecule inhibitors
Lamb et al., Science. 2006 313(5795)1929-35
69
The IGF is enriched for transcriptional targets
of the PI3K/Akt/mTOR pathway
70
IGF signature is present in human breast cancers
71
Widespread deregulation of gene expression in
cancer
  • Gene expression profiling identifies different
    subtypes of breast cancer.

Sorlie et al. PNAS. 2003 100(14)8418-23
72
IGF signature is present in luminal B and basal
breast tumors
Data from Sorlie et al. PNAS. 2003
100(14)8418-23
73
IGF signature is associated with poor prognosis
in ER breast tumors
74
Relating gene expression profile results from
different datasets to each other by unsupervised
clustering methods USUALLY NOT A GOOD IDEA
  • Unsupervised clustering is a technique for data
    analysis that partitioning a data set into
    subsets whose elements share common traits
  • Many groups will try to relate a gene signature
    to another dataset by clustering the samples in
    the dataset using the genes in the signature
  • The main problem with this Unsupervised
    clustering does not take the direction of the
    genes in the signature into account.

75
  • Identification ofa Common Serum Response (CSR)
    gene signature in fibroblasts
  • Starve fibroblasts, then give them serum and see
    what genes are up-regulated or down-regulated.

Chang et al., PLoS Biol. 2004 Feb2(2)E7
76
Survey of fibroblast CSR geneexpression in human
cancers
  • Using the genes in the CSR signature, cluster
    human tumors.
  • Tumor form two major groups.

Chang et al., PLoS Biol. 2004 Feb2(2)E7
77
Prognostic value of fibroblast CSR in epithelial
tumors
  • Tumors in the activated group had worse outcome.

Chang et al., PLoS Biol. 2004
78
What issues are these with this type of analysis
approach?
  • The clustering method does not tell us which
    direction the CSR gene are moving.
  • Are genes up in the CSR signature also up in the
    Activated tumor set?

79
What issues are these with this type of analysis
approach?
  • These bars indicate the direction of the CSR
    genes in these clusters (redup)
  • CSR pattern does appear here to be manifested in
    half the tumors.

80
Excel functions/features you will need for the
computational exercise
81
TTEST Worksheet function
TTEST(array1,array2,tails,type)
  • Array1    is the first data set.
  • Array2    is the second data set.
  • Tails    specifies the number of distribution
    tails (Use 2 for the computational exercise.)
  • Type    is the kind of t-Test to perform (Use
    2).

82
AVERAGE Worksheet function
AVERAGE(number1, number2)
  • Number1, number2, ...    are 1 to 30 numeric
    arguments for which you want the average.
  • The arguments must either be numbers or be names,
    arrays, or references that contain numbers.

83
Data-gtFilter-gtAutoFilter
  • When you use the AutoFilter command, AutoFilter
    arrows appear to the right of the column labels
    in the filtered range.
  • Microsoft Excel indicates the filtered items with
    blue.
  • You use custom AutoFilter to display rows that
    meet complex criteria for example, you might
    display rows that contain values within a
    specific range (e.g. plt0.01)
  1. Unfiltered range
  2. Filtered range

84
MATCH Worksheet function
MATCH(lookup_value,lookup_array,match_type)
  • Lookup_value   is the value you use to find the
    value you want in a table.
  • Lookup_value is the value you want to match in
    lookup_array. For example, when you look up
    someone's number in a telephone book, you are
    using the person's name as the lookup value, but
    the telephone number is the value you want.
  • Lookup_value can be a value (number, text, or
    logical value) or a cell reference to a number,
    text, or logical value.
  • Lookup_array   is a contiguous range of cells
    containing possible lookup values. Lookup_array
    must be an array or an array reference.
  • Match_type   should be set to 0 for our purposes.

85
COUNT Worksheet function
  • If an argument is an array or reference, only
    numbers in that array or reference are counted.
    Empty cells, logical values, text, or error
    values in the array or reference are ignored.

86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
(Dont forget the )
95
(No Transcript)
96
R functions you will need for the computational
exercise
97
dhyper function in R
  • Example
  • 100 balls
  • 10 of the balls are red
  • I grab 20 balls
  • Five of my 20 balls are red
  • Was the number of red balls I selected a
    significant number ?

gt mlt-10 number of red balls gt nlt-90
number of other balls (total pop-m) gt klt-20
number of balls selected gt xlt-0k vector of
successes gt 1-sum(dhyper(x,m,n,k)15) 1
0.02546455
98
Compare NSCLC EGFR mutant signature with a
signature of EGFR-transfected MCF-7 cells
  • EGFR wt NSCLC genes 119
  • MCF7 EGFR genes 1152
  • Genes shared between MCF7/NSCLC array platforms
    11079
  • Genes shared between MCF7/NSCLC gene signatures
    44

significance of overlap plt1E-10
One-sided Fishers exact test
Choi, Creighton, et al., PLoS ONE 2(11) e1226.
99
dhyper function in R
  • EGFR mutant signature example
  • 11079 Genes shared between MCF7/NSCLC array
    platforms
  • 119 EGFR wt NSCLC genes
  • 1162 MCF7 EGFR genes
  • 44 genes shared between MCF7/NSCLC gene signatures

gt mlt-119 number of EGFR wt NSCLC genes gt
nlt-11079-119 number of other genes gt klt-1162
number of MCF7 EGFR genes gt xlt-0k
vector of successes gt 1-sum(dhyper(x,m,n,k)144
) 1 1.265654e-14
100
General concepts of gene expression analysis
101
General concepts of gene expression analysis
  • Low level analysis
  • Processing image files.
  • Normalization
  • QC
  • High level analysis
  • Clustering
  • Selecting differentially expressed genes.
  • Enrichment analysis

102
Processing image files
  • From CEL, GPR, or TXT files with image
    information, want to generate gene expression
    values
  • For two color arrays (e.g. Stanford cDNA arrays),
    can use Bioconductor
  • For one channel array (e.g. Affymetrix), can use
    dChip or Bioconductor

103
Normalization
  • Purpose To adjust the overall chip brightness of
    the arrays to a similar level
  • Methods
  • Two channel arrays
  • Loess normalization is good
  • One channel arrays
  • Total intensity normalization
  • Quantile normalization
  • Invariant set normalization

104
Before Normalization
After Normalization
www.dchip.org
105
High level analysis
  • Selecting differentially expressed genes
  • Account for multiple testing
  • Clustering
  • Hierarchical clustering
  • Principal Components analysis
  • K-means clustering
  • Enrichment analysis or Meta-analysis

106
Selecting differentially expressed genes
  • Students t-test or ANOVA typically used
  • Works best on log-transformed data
  • Other criteria
  • fold change
  • Higher average signal intensity might indicate
    greater abundance
  • What p-value cutoff do you choose?
  • No right answer
  • Need to balance between false positives and false
    negatives
  • More stringent p-value, fewer false positives,
    more false negatives
  • Less stringent p-value, fewer false negatives,
    more false positives

107
Multiple testing
  • When evaluating thousands of genes, some will
    show a nominally significant P-value by chance
    alone
  • Somewhat like buying lots and lots of lottery
    tickets your chances of winning greatly improve.
  • Want to estimate false discovery rate (FDR)

108
Multiple testing
  • Estimate FDR by method from Storey et al. (PNAS
    2003 1009440-5).
  • Use permutation testing (e.g. SAM analysis,
    Tusher et al., PNAS 2001 985116-21)
  • Randomly assign sample labels and do the test
  • Do it many times to get a distribution of false
    positives

Number of genes on the array X nominal
P-value Number of genes significant with that
P-value
FDR
109
Cluster analysis
  • Cluster analysis relates to grouping or
    segmenting a collection of objects (e.g. genes or
    samples) into subsets or "clusters", such that
    those within each cluster are more closely
    related to one another than objects assigned to
    different clusters.
  • Central to cluster analysis is the notion of
    degree of similarity (or dissimilarity) between
    the individual objects being clustered.

110
Cluster analysis
  • Major methods of clustering include hierarchical
    clustering, k-means clustering, and principal
    components analysis (PCA)
  • Heirarchical clustering most common for
    expression profile data analysis
  • Cluster and JavaTreeview public software
    programs fomr Eisen et al. (http//rana.lbl.gov/)
    are handy for cluster analysis and/or generating
    heat maps

111
Hierarchical clustering 3 methods for measuring
distance between clusters
  • Single linkage, using the members of each cluster
    that are closest to each other

http//www.resample.com/xlminer/help/HClst/HClst_i
ntro.htm
112
Hierarchical clustering 3 methods for measuring
distance between clusters
  • Complete linkage, using the members of each
    cluster that are furthest from each other

http//www.resample.com/xlminer/help/HClst/HClst_i
ntro.htm
113
Hierarchical clustering 3 methods for measuring
distance between clusters
  • Average linkage, using the average of each
    cluster, most commonly used.

http//www.resample.com/xlminer/help/HClst/HClst_i
ntro.htm
114
Widespread deregulation of gene expression in
cancer
  • Gene expression profiling identifies different
    subtypes of breast cancer.

Sorlie et al. PNAS. 2003 100(14)8418-23
115
Final words on gene expression profile analysis
  • All good roads lead to Rome.
  • i.e., there are many ways to go about exploratory
    analysis, which can lead to the same overall
    conclusions
  • Whats important
  • Be clear and concise about what you did (so
    others can understand it and repeat it)
  • Dont try to fool anybody (including yourself)
Write a Comment
User Comments (0)
About PowerShow.com