Title: Analysis of Gene Expression Data
1Analysis of Gene Expression Data
- Yoonsoo Pyon
- ysp2_at_case.edu
- Feb. 8th, 2008
2MicroArray
- What are they?
- allow 1000s of expression analyses to be
performed concurrently.
3DNA Chip Microarrays
- Put a large number (100K) of cDNA sequences or
synthetic DNA oligomers onto a glass slide (or
other subtrate) in known locations on a grid. - Label an RNA sample and hybridize
- Measure amounts of RNA bound to each square in
the grid - Make comparisons
- Cancerous vs. normal tissue
- Treated vs. untreated
- Time course
- Many applications in both basic and clinical
research
4Goals of a Microarray Experiment
- Find the genes that change expression between
experimental and control samples - Classify samples based on a gene expression
profile - Find patterns Groups of biologically related
genes that change expression together across
samples/treatments
5Potential Microarray Applications
- Drug discovery / toxicology studies
- Mutation/polymorphism detection Differing
expression of genes over - Time
- Tissues
- Disease States
- Sub-typing complex genetic diseases
6cDNA Microarray Technologies
- Spot cloned cDNAs onto a glass microscope slide
- usually PCR amplified segments of plasmids
- Label 2 RNA samples with 2 different colors of
flourescent dye - control vs. experimental - Mix two labeled RNAs and hybridize to the chip
- Make two scans - one for each color
- Combine the images to calculate ratios of amounts
of each RNA that bind to each spot
7Spot your own Chip
Robot spotter
Ordinary glass microscope slide
8cDNA Spotted Microarrays
9Data Acquisition
- Scan the arrays
- Quantitate each spot
- Subtract background
- Normalize
- Export a table of fluorescent intensities for
each gene in the array
10MicroArray
- Overview of image analysis
- Grid finding
- grid alignment
- skew
- Quantification of image
- variable background
- uneven hybridization
11Image Analysis/Data Quantization
- Feature (target ? probe) segmentation
- Data extraction and quantization of
- Background
- Feature
- Correlation of feature identity and location
within image - Display of pseudo-color image
12Image Segmentation
13Normalization
- Can control for many of the experimental sources
of variability (systematic, not random or gene
specific) - Bring each image to the same average brightness
- Can use simple math or fancy -
- divide by the mean (whole chip or by sectors)
- LOESS (locally weighted regression)
- No sure biological standards
14Are the Treatments Different?
- Analysis of microarray data has tended to focus
on making lists of genes that are up or down
regulated between treatments - Before making these lists, ask the
question "Are the treatments different?" - Use standard statistical methods to evaluate
expression profiles for each treatment (t-test or
f-test) - If there are differences, find the genes most
responsible - If there are not significant overall differences,
then lists of genes with large fold changes may
only reflect random variability.
15Microarray Experiment Design
- Type I (n 2)
- How is this gene expressed in target 1 as
compared to target 2? - Which genes show up/down regulation between the
two targets? - Type II (n gt 2)
- How does the expression of gene A vary over time,
tissues, or treatments? - Do any of the expression profiles exhibit similar
patterns of expression?
16Basic Data Analysis
- Fold change (relative increase or decrease in
intensity for each gene) - Set cutoff filter for low values (background
noise) - Cluster genes by similar changes - only really
meaningful across multiple treatments or time
points - Cluster samples by similar gene expression
profiles
17Streamlined Affy Analysis
Normalize
Filter
Present/AbsentMinimum valueFold change
Raw data
Classification
Significance
Clustering
Machine learning
t-test Rank Product
Gene lists
18Differential Expression
- Type I analysis
- Look for genes with vastly different expression
under different conditions - How do you measure vastly different?
- What role should derived statistics play?
19Type I Differential Expression
20Multiple Test
- In a microarray experiment, each gene (each probe
or probe set) is really a separate experiment - Yet if you treat each gene as an independent
comparison, you will always find some with
significant differences - (the tails of a normal distribution)
21Multiple test
- Bonferroni correction
- ag/n global level divided by the number of
tests - Too strict
- Holms stepwise correction
- If p1 lt ag/n then adjust the remaing n-1 p-values
by comparing the next p-value p2 lt ag/(n-1). - If m is the largest integer for which pm lt
ag/(n-m1), then we call gene 1, , m is
significantly differentially expressed. - Still too strict
22False Discovery
- Statisticians call false positives a "type 1
error" or a "False Discovery" - False Discovey Rate (FDR) is equal to the p-value
of the t-test X the number of genes in the array - For a p-value of 0.01 X 10,000 genes 100
false different genes - You cannot eliminate false positives, but by
choosing a more stringent p-value, you can keep
them manageable (try p0.001) - The FDR must be smaller than the number of real
differences that you find - which in turn depends
on the size of the differences and varability of
the measured expression values
23type I , II error
24Higher LevelMicroarray data analysis
- Clustering and pattern detection
- Data mining and visualization
- Controls and normalization of results
- Statistical validatation
- Linkage between gene expression data and gene
sequence/function/metabolic pathways databases - Discovery of common sequences in co-regulated
genes - Meta-studies using data from multiple experiments
25Clustering
- Identify co-regulated genes with microarray
experiments (assumption?) - Identify genes with similar expression
- Grouping unknown genes with known genes may
provide insight into function of unknown genes - Only useful for genes with varying expression
levels
26Types of Clustering
- Herarchical
- Link similar genes, build up to a tree of all
- Self Organizing Maps (SOM)
- Split all genes into similar sub-groups
- Finds its own groups (machine learning)
- Principle Component Analysis (PCA)
- every gene is a dimension (vector), find a single
dimension that best represents the differences in
the data
27Clustering
- Pairwise similarity measure
- Minkowskys distance
- if q 1 ? Manhattan distance
- if q2 ? Euclidean distance
- Pearsons or Spearmans correlation coefficient
- Treating with missing value
28Clustering
- Data transformation
- Useful before compute pairwise similarity
- Ex) x1(100,200,300), x2(10,20,30),
x3(30,20,10) - Divide each component xj of p-dimensional data
vector by its Euclidean norm
29Hierarchical Clustering
30Hierarchical Clustering
- require
- Dissimilarity measure between pair of cluster
- Update procedure for recalculation of merged
cluster - Weakness
- do not repair false joining of data points from
previous step
31K-means clustering
32Self Organizing Maps
33Classification vs. Clustering
- Purpose
- Clustering To partition genes into
co-expression group by suitable optimization
method - Classification To assign given condition to
preexisting classes of condition
34Classification
- How to sort samples into two classes based on
gene expression data - Cancer vs. normal
- Cancer sub-types (benign vs. malignant)
- Responds well to drug vs. poor response (i.e.
tamoxifen for breast cancer)
35Support Vector Machine (SVM)
- Main idea Select hyperplane that is more likely
to generalize on a future datum
36Cross-validation
- Holdout validation
- K-fold cross-validation
- Leave-one-out cross-validation
37Reverse Engineering Genetic Networks
- Reconstruction of the interactions in a
qualitative way from experimental data - Once determined, these networks can be used to
predict gene expression of corresponding genes - Can we reconstruct the qualitative interactions
of corresponding genes? No. - Time-dependent measurement
- Knockout experiments
38Gene Regulatory Networks
39Network motif
- Problem Dimensionality of gene regulatory
network - Breakdown this network into small components
called network motif and connect to Ensemble
network