Title: BIOS816/VBMS818 Lecture 8
1BIOS816/VBMS818 Lecture 8 Microarray Analysis
- Guoqing Lu
- Office E115 Beadle CenterTel (402)
472-4982Email glu3_at_unl.eduWebsite
http//biocore.unl.edu
2Introduction to DNA Microarray
- Microarray revolutionized biology and medicine
research - One gene at a time before, now tens of thousands
simultaneously - Gene expression
- Measure the expression levels of many thousands
of genes in only a few biological samples - E.g., sample from specific organ to show which
genes are expressed - E.g., compare samples from healthy and sick host
to find gene-disease connection - E.g., probes are sets of human pathogens for
disease detection -
3Introduction to DNA Microarray (contd)
- Replicates are needed
- Technical replicates, i.e. measuring gene
expression with the same starting material on
independent arrays - Biological replicates, e.g. measuring gene
expression from multiple cell lines - The challenge to the biologist is to apply
appropriate statistical techniques to determine
which changes are relevant
Affymetrix U133 Plus 2.0 47,000 x 11 x 2
4Microarray Experiment
- http//www.bio.davidson.edu/courses/genomics/chip/
chip.html
Flash Animation
5Microarray Technology
- Basic principle is the same
- DNA complementary to genes of interest is
generated and laid out in microscopic quantities
on solid surfaces at defined positions - DNA from samples is eluted over the surface,
complementary DNA binds - Presence of bound DNA is detected by florescence
following laser excitation
6Two Different Techniques
- Spotted cDNA
- DNA sequences are laid down through spotting
- Complete sequences are laid down
- Cheaper
- Usually measures relative expression in two
samples - E.g., Systeni/Stanford
- Oligonucleotide arrays
- DNA sequences are laid down through
photolithography - a series of fragments are laid down
- Probably give higher quality results
- Usually measures expression in a single sample
- Mainly supplied by Affymetrix Inc.
7Spotted cDNA
- Uses available cDNA libraries to create the array
- Quality depends on choice of cDNAs
- Cross hybridization and non-specific binding can
be a problem - Reduce cross hybridization by choosing highly
gene specific DNA for the spots
mRNA
8Oligonucleotide arrays
- Each gene represented by 11 20 paired oligos
- Each represents a different part of the gene
- Oligos are produced in situ on the chip
- Each pair comprises two 25-mers
- Perfect match (PM)
- Mismatch (MM)
Affymetrix uses a unique combination of
photolithography and combinatorial chemistry to
manufacture GeneChip Arrays.
http//www.affymetrix.com/technology/manufacturing
/index.affx
9PM and MM oligos
PM ATCTGCGTGTCGTAGTGTGACCCCA MM
ATCTGCGTGTCGAAGTGTGACCCCA
- By measuring the difference in hybridization
to the PM and MM oligos the effect of
non-specific and cross hybridization is minimised - Using 11-20 pairs from different parts of the
gene these effects are further reduced
https//www.affymetrix.com/support/downloads/manua
ls/data_analysis_fundamentals_manual.pdf
10Microarray Data Analysis
- Data preprocessing
- allow data sets from two (or more) samples to be
compared to each other - Inferential statistics
- hypothesis testing
- the likelihood that particular genes are
significantly regulated - Descriptive (exploratory) statistics
- clustering and principal components analysis
- inspect the complex data set for biologically
meaningful patterns
11Microarray data analysis
- Begin with a data matrix (gene expression values
versus samples) - Typically, there are many genes (gtgt 10,000) and
few samples ( 10)
12Preprocessing normalization
- Normalization is needed
- To compare signal intensities on two arrays
- To compare two mRNA samples on the same array
- Make sure the samples are equivalent in some sense
13Normalization methods
- Adjust spot values to show
- Same total mRNA in all samples
- Or, Same expression level for certain
housekeeping genes - Use spiked controls
- Add equal amounts of a different mRNA to each
sample and normalize to equalize intensity for
these spots
cDNA Array
Oligo Array
Both Arrays
14Data analysis global normalization
- Global normalization procedure
- Step 1 subtract background intensity values (use
a blank region of the array) - Step 2 globally normalize so that the average
ratio 1 (apply this to 1-channel or 2-channel
data sets)
Do Exercise!
Affymetrix scaling!!!
15Scatter plots
- Useful to represent gene expression values from
two microarray experiments (e.g. control,
experimental) - Each dot corresponds to a gene expression value
- Most dots fall along a line
- Outliers represent up-regulated or down-regulated
genes
16Inferential statistics
- Inferential statistics are used to make
inferences about a population from a sample. - Hypothesis testing is a common form of
inferential statistics - A null hypothesis is stated, such as There is
no difference in signal intensity for the gene
expression measurements in normal and diseased
samples. - The alternative hypothesis is that there is a
difference. - We use a test statistic to decide whether to
accept or reject the null hypothesis. For many
applications, we set the significance level a to
p lt 0.05.
17Inferential statistics
Paradigm Parametric test Nonparametric
Compare two unpaired groups Unpaired t-test Mann-Whitney test
Compare two paired groups Paired t-test Wilcoxon test
Compare 3 or more groups ANOVA
18Significance analysis of microarrays (SAM)
- SAM
- an Excel plug-in
- modified t-test
- adjustable false discovery rate
http//www-stat.stanford.edu/tibs/SAM/
19SAM
up- regulated
observed
expected
down-regulated
20Descriptive statistics
- Microarray data are highly dimensional there are
many thousands of measurements made from a small
number of samples. - Descriptive (exploratory) statistics help you to
find meaningful patterns in the data. - A first step is to arrange the data in a matrix.
- Next, use a distance metric to define the
relatedness of the different data points. - Two commonly used distance metrics are
- Euclidean distance
- Pearson coefficient of correlation
21Descriptive statistics clustering
- Clustering algorithms offer useful visual
descriptions of microarray data. - Genes may be clustered, or samples, or both.
- This may be agglomerative (building up the
branches of a tree, beginning with the two most
closely related objects) or divisive (building
the tree by finding the most dissimilar objects
first). - In each case, we end up with a tree having
branches and nodes.
22agglomerative
4
3
2
1
0
a
a,b
b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
divisive
23Cluster and TreeView
- Perform a variety of types of cluster analysis
and other types of processing on large microarray
datasets - Clustering
- K means
- SOM
- PCA
http//rana.lbl.gov/EisenSoftware.htm
24Cluster and TreeView
http//rana.lbl.gov/manuals/ClusterTreeView.pdf
25Cluster and TreeView
26Two-way clustering of genes (y-axis) and cell
lines (x-axis) (Alizadeh et al., 2000)
27K-means clustering
- Clusters the expression profiles into K clusters
- You have to specify K
- Produces clusters that are as tight as possible
- Each cluster has a centroid or mean expression
profile - Tightness of clusters measured by the sum of
squared distances between each gene and the
centroid of its cluster - Algorithm tries to minimise this sum
28Self-organizing maps (SOM)
- Unlike k-means clustering, which is unstructured,
SOMs allow one to impose partial structure on the
clusters. - The principle of SOMs
- One chooses an initial geometry of nodes such
as a 3 x 2 rectangular grid (indicated by solid
lines in the figure connecting the nodes). - Hypothetical trajectories of nodes as they
migrate to fit data during successive iterations
of SOM algorithm are shown. - Data points are represented by black dots, six
nodes of SOM by large circles, and trajectories
by arrows.
29Microarray Software
30Exercise
- http//pevsnerlab.kennedykrieger.org/hinxton.html
- Thanks to Dr. Jonathan Pevsner