Title: Introduction%20to%20Microarrays
1Introduction to Microarrays
- Kellie J. Archer, Ph.D.
- Assistant Professor
- Department of Biostatistics
- kjarcher_at_vcu.edu
2Microarrays
A snapshot that captures the activity pattern of
thousands of genes at once.
Affymetrix GeneChip
Custom spotted arrays
3Spotted Microarray Process
CTRL
TEST
4Affymetrix GeneChip Probe Arrays
Hybridized Probe Cell
GeneChip Probe Array
Single stranded, fluorescently labeled DNA target
24µm
Oligonucleotide probe
1.28cm
Each probe cell or feature contains millions of
copies of a specific oligonucleotide probe
Over 250,000 different probes complementary to
genetic information of interest
Image of Hybridized Probe Array
BGT108_DukeUniv
5Applications of microarrays
- Cancer research Molecular characterization of
tumors on a genomic scale more reliable
diagnosis and effective treatment of cancer - Immunology Study of host genomic responses to
bacterial infections - Model organisms Multifactorial experiments
monitoring expression response to different
treatments and doses, over time or in different
cell types - etc.
6Applications of Microarrays
- Compare mRNA transcript levels in different type
of cells, i.e., vary - Tissue (liver vs. brain)
- Treatment (Drugs A, B, and C)
- State (tumor vs. normal)
- Organism (yeast, different strains)
- Timepoint
- etc.
7(No Transcript)
8Affymetrix Design
11 20 Probe Pairs interrogate each gene
PM
GCGCCGGCTGCAGGAGCAGGAGGAG
GCGCCGGCTGCACGAGCAGGAGGAG
MM
9Image Analysis Pixel Level Data
6 x 6 matrix of pixels for each PM and MM
probe HG-U133A GeneChip
10Expression Quantification
PM and MM intensities are combined to form an
expression measure for the probe set (gene)
PM
GCGCCGGCTGCAGGAGCAGGAGGAG
GCGCCGGCTGCACGAGCAGGAGGAG
MM
11Expression Quantification
- Initially, Affymetrix signal was calculated as
-
-
where j indexes
the probe pairs for each probe set A. This is
known as the Average Difference method. - Problems
- Large variability in PM-MM
- MM probes may be measuring signal for another
gene/EST - PM-MM calculations are sometimes negative
12Expression Quantification
- The mean of a random variable X is a measure of
central location of the density of X. - The variance of a random variable is a measure of
spread or dispersion of the density of X. - Var(X)E(X-?)2 E(X2) - ?2
- Standard deviation ?
13Expression QuantificationIllustration Average
Difference.xls
14(No Transcript)
15Sources of Obscuring Variation in Microarray
Measurements
- Sample handling (degree of physical manipulation,
time from extripation to freezing) - Microarray manufacture
- Sample processing (extraction procedure, RNA
integrity purity, RNA labeling) - Processing differences (hybridization chambers,
washing modules, scanners) - Personnel differences
- Random differences in signal intensity in a data
set which co vary with the biological process
16Normalization
- The purpose of normalization is to remove
experimental artifacts of no direct interest,
that is, the removal of systematic effects other
than differential expression. Normalization
procedures often include - background subtraction,
- detection of outliers,
- and removal of variation due to
- differences in sample preparation,
- array differences,
- differences in dye labeling efficiencies,
- and scanning differences.
1716 Replicate HG-133A GeneChips, Before
normalization
1816 Replicate HG-133A GeneChips, After
normalization
19(No Transcript)
20Taxonomy of Microarray Data Analysis Methods
- Unsupervised Learning The statistical analysis
seeks to find structure in the data without
knowledge of class labels. - Supervised Learning Class or group labels are
known a priori and the goal of the statistical
analysis pertains to identifying differentially
expressed genes (AKA feature selection) or
identifying combinations of genes that are
predictive of class or group membership.
21Unsupervised Learning
- Unsupervised learning or clustering involves the
aggregation of samples into groups based on
similarity of their respective expression
patterns without knowledge of class labels. - Examples of Unsupervised Learning methods include
- Hierarchical clustering
- k-means
- k-medoids
- Self Organizing Maps
- Principal Components
- Multidimensional Scaling
22Supervised Learning
- Example methods for Class comparison/ Feature
selection include - T-test / Wilcoxon rank sum test
- F-test / Kruskal Wallis test
- etc.
- Example methods for Class Prediction include
- Weighted voting
- K nearest neighbors
- Compound Covariate Predictors
- Classification trees
- Support vector machines
- etc.
23Supervised Learning Class Prediction
- Risk of over-fitting the data may have a perfect
discriminator for the data set at hand but the
same model may perform poorly on independent data
sets. - Most prediction methods are intended for large
n (samples) small p (covariates) datasets. - Process is to
- Fit model
- Check model adequacy
- Make an inference
24Class Prediction Checking model Adequacy
- Regardless of algorithm used, it is essential
that once the prediction rule has been defined,
an unbiased estimate of the true error rate must
be calculated.
25Class Prediction Checking Model Adequacy
- In a data rich situation,
- randomly divide the dataset into two parts,
representing a training and test dataset. - Build the prediction algorithm using the training
dataset - Once a final model has been developed, the
prediction rule is applied to the test dataset to
estimate the misclassification error
26Class Prediction Checking Model Adequacy
- For small sample sizes, withholding a large
portion of the data for validation purposes may
limit the ability of developing a prediction
rule. Therefore, use cross-validation techniques
to assess the error.
27Class Prediction Checking Model Adequacy
- K-fold cross-validation requires one to randomly
split the dataset into K equally sized groups. - Thereafter, the model is fit to K-1 parts of the
data and the generalization error is calculated
using the Kth remaining part of the data. - This procedure is repeated so that the
generalization error is estimated for each of the
K parts of the data, providing an overall
estimate of the generalization error and its
associated standard error.
28Class Prediction Checking Model Adequacy
1 2 3 4 5 6 7 8 9 10
- Leave out data in group 3
- Fit the model to the data in groups 1 2, 4
10 (learning - dataset)
- Calculate the error using observations in group
3 as the - test dataset
- Do this for each of the 10 partitions