Title: Analyzing Expression Data: Clustering and Stats
1Analyzing Expression DataClustering and Stats
2Goals
- Weve measured the expression of genes or
proteins using the technologies discussed
previously. - What can we do with that information?
- Identify significant differences in expression
- Identify similar patterns of expression
(clustering)
3Analysis steps
- Data normalization
- Statistical Analysis
- Cluster Analysis
4I. Data Normalization
- Why normalize?
- Removes systematic errors
- Makes the data easier to analyze statistically
5Sources of Error
- Measurements always contain errors.
- Systematic (oops)
- Random (noise!)
- Subtracting the background level can remove some
systematic error - Using the ratio in two-channel experiments does
this - Subtracting the overall average intensity can be
used with one-channel data. - Taking averages over replicates of the experiment
reduces the random error. - Advanced error models are mentioned on p. 628 and
covered in Further Reading.
6Expression data usually not Gaussian (normal)
- Many statistical tests assume that the data is
normally distributed. - Expression microarray spot intensity data (for
example) is not. - Intensity ratio data (two-channel) is not normal
either. - Both go from 0 to infinity whereas normal data is
symmetrical.
7Taking the logarithm helps normalize expression
ratio data
- The expression ratio plotted versus the
expression level (geometric mean) in both
channels. - Plotting the log ratio vs. the log expression
level gives data that is centered around y0 and
fairly normal looking.
8Taking the log of the expression ratio fixes
the left tail
9LOWESS Normalization
- Sometimes there is still a bias that depends on
the expression level. - This can be removed by a type of regression
called Locally Weighted Scatterplot Smoothing. - This computes and subtracts the mean locally for
various values of expression level (RG).
10II. Statistical Analysis
- Determining what differences in expression are
statistically significant - Controlling false positives
11When are two measurements significantly different?
- We want to say that an expression ratio is
significant if it is big enough (gt1) or small
enough (lt1). - A two-fold ratio (for example) is only
significant if the variances of the underlying
measurements are sufficiently small. - The significance is related to the area of the
overlap of the underlying distributions.
12The Z-test
- If the data is approximately normal, convert it
to a Z-score. - X can be the log expression ratio ? is then 0
- ? is the sample standard deviation n is the
number of repeats - The Z-score is distributed N(0,1) (standard
normal). - The significance level is the area in the tail(s)
of the standard normal distribution.
13The t-test
- The t-test makes fewer assumptions about the data
than the Z-test - It can be applied to compare two average
measurements which can have - Different variances
- Different numbers of observations
- You compute the t-statistic (see pages 654-655)
and then look up the significance level of the
Students T distribution in a table.
14III. Cluster Analysis
- Similar expression patterns
- Groups of genes/proteins with similar expression
profiles - Similar expression sub-patterns
- Groups of genes/proteins with similar expression
profiles in a subset of conditions - Different clustering methods
- Assessing the value of clusters
15Example Gene Expression Profiles
- Expression level of a gene is measured at
different time points after treating cells. - Many different expression profiles are possible.
- No effect
- Immediate increase or decrease
- Delayed increase or decrease
- Transient increase or decrease
16Clustering by Eye
- n genes or proteins
- m different samples (or conditions)
- Represent a gene as a point
- X ltx1, x2, , xmgt
- If m is 1 or 2 (or even 3) you can plot the
points and look for clusters of genes with
similar expression. - But what if m is bigger than 3?
- Need to reduce the dimensionality PCA
17Reducing the Dimensionality of Data Principal
Components Analysis
- PCA linearly map each point to a small set of
dimensions (components). - The principal components are dimensions that
capture the maximum variation in the data. - The principal components capture most of the
important information in the data (usually). - Plotting each points values in two of the
principal component dimensions allows us to see
clusters.
2-D Gel Data
18PCA An IllustrationYeast Cell Cycle Gene
Expression
- Singular value decomposition of a matrix X (SVD)
is - X U ? VT
- The mapped value of X is
- Y X VT
- The rows of Y give the mapping of each gene.
- Mapped gene i Yi lty1, y2, ., ymgt
_at_PNAS (2000)
19Clustering Using Statistics
- Algorithm identifies groups.
- Example similar expression profiles
- Distance measure between pairs of points is
needed.
20Distance Measures Between Pairs of Points
- In order to cluster the points (genes or
conditions), we need some concept of which points
are close to each other. - So we need a measure of distance (or,
conversely,) similarity between two rows (or
columns) in our n by m matrix. - We can then compute all the pair-wise distances
between rows (or columns).
21Standard Distance Measures
- Euclidean Distance
- Pearson Correlation Coefficient
- Mahalanobis Distance
22Euclidean Distance
- Standard, everyday distance
- Treats all dimensions equally
- If some genes vary more than others (have higher
variance), they influence the distance more.
23Mahalanobis Distance
- The normalized Euclidean distance
- Scales each dimension by the variance in that
dimension. - This is useful if the genes tend to vary much
more in one sample than in others since it
reduces the affect of that sample on the
distances.
24Pearson Correlation Coefficient
- Distances are small when two genes have similar
patterns of change even if the size of the
changes are different. - This is accomplished by scaling by the sample
variance of the genes expression levels under
different conditions.
25Choice of Distance Matters
- Heirarchical clustering (dentrogram) of tissues.
- Corresponds to clustering the columns of the
matrix. - Branches are different (cancer B/C vs A/B).
26Clustering Algorithms
- Hierarchical Clustering
- K-means clustering
- Self-organizing maps and trees
27Hierachical Clustering
- Algorithms progressively merge clusters or split
clusters. - Merging criterion can by single-linkage or
complete-linkage. - Produce dendrograms
- Can be interpreted at different thresholds.
28Types of Linkage
- A. Single Linkage
- B. Complete Linkage
- C. Centroid Method
29K-means Clutering
- Related to Expectation Maximization
- You specify the number of clusters
- Iteratively moves the means of the clusters to
maximize the likelihood (minimize total error).