Analyzing Expression Data: Clustering and Stats - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Analyzing Expression Data: Clustering and Stats

Description:

Mapped gene i: Yi = y1, y2, ...., ym _at_PNAS (2000) Clustering Using Statistics ... have similar patterns of change even if the size of the changes are different. ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 30

Provided by: research17

Category:

more less

Transcript and Presenter's Notes

Title: Analyzing Expression Data: Clustering and Stats

1
Analyzing Expression DataClustering and Stats

Chapter 16

2
Goals

Weve measured the expression of genes or
proteins using the technologies discussed
previously.
What can we do with that information?
Identify significant differences in expression
Identify similar patterns of expression
(clustering)

3
Analysis steps

Data normalization
Statistical Analysis
Cluster Analysis

4
I. Data Normalization

Why normalize?
Removes systematic errors
Makes the data easier to analyze statistically

5
Sources of Error

Measurements always contain errors.
Systematic (oops)
Random (noise!)
Subtracting the background level can remove some
systematic error
Using the ratio in two-channel experiments does
this
Subtracting the overall average intensity can be
used with one-channel data.
Taking averages over replicates of the experiment
reduces the random error.
Advanced error models are mentioned on p. 628 and
covered in Further Reading.

6
Expression data usually not Gaussian (normal)

Many statistical tests assume that the data is
normally distributed.
Expression microarray spot intensity data (for
example) is not.
Intensity ratio data (two-channel) is not normal
either.
Both go from 0 to infinity whereas normal data is
symmetrical.

7
Taking the logarithm helps normalize expression
ratio data

The expression ratio plotted versus the
expression level (geometric mean) in both
channels.
Plotting the log ratio vs. the log expression
level gives data that is centered around y0 and
fairly normal looking.

8
Taking the log of the expression ratio fixes
the left tail
9
LOWESS Normalization

Sometimes there is still a bias that depends on
the expression level.
This can be removed by a type of regression
called Locally Weighted Scatterplot Smoothing.
This computes and subtracts the mean locally for
various values of expression level (RG).

10
II. Statistical Analysis

Determining what differences in expression are
statistically significant
Controlling false positives

11
When are two measurements significantly different?

We want to say that an expression ratio is
significant if it is big enough (gt1) or small
enough (lt1).
A two-fold ratio (for example) is only
significant if the variances of the underlying
measurements are sufficiently small.
The significance is related to the area of the
overlap of the underlying distributions.

12
The Z-test

If the data is approximately normal, convert it
to a Z-score.
X can be the log expression ratio ? is then 0
? is the sample standard deviation n is the
number of repeats
The Z-score is distributed N(0,1) (standard
normal).
The significance level is the area in the tail(s)
of the standard normal distribution.

13
The t-test

The t-test makes fewer assumptions about the data
than the Z-test
It can be applied to compare two average
measurements which can have
Different variances
Different numbers of observations
You compute the t-statistic (see pages 654-655)
and then look up the significance level of the
Students T distribution in a table.

14
III. Cluster Analysis

Similar expression patterns
Groups of genes/proteins with similar expression
profiles
Similar expression sub-patterns
Groups of genes/proteins with similar expression
profiles in a subset of conditions
Different clustering methods
Assessing the value of clusters

15
Example Gene Expression Profiles

Expression level of a gene is measured at
different time points after treating cells.
Many different expression profiles are possible.
No effect
Immediate increase or decrease
Delayed increase or decrease
Transient increase or decrease

16
Clustering by Eye

n genes or proteins
m different samples (or conditions)
Represent a gene as a point
X ltx1, x2, , xmgt
If m is 1 or 2 (or even 3) you can plot the
points and look for clusters of genes with
similar expression.
But what if m is bigger than 3?
Need to reduce the dimensionality PCA

17
Reducing the Dimensionality of Data Principal
Components Analysis

PCA linearly map each point to a small set of
dimensions (components).
The principal components are dimensions that
capture the maximum variation in the data.
The principal components capture most of the
important information in the data (usually).
Plotting each points values in two of the
principal component dimensions allows us to see
clusters.

2-D Gel Data
18
PCA An IllustrationYeast Cell Cycle Gene
Expression

Singular value decomposition of a matrix X (SVD)
is
X U ? VT
The mapped value of X is
Y X VT
The rows of Y give the mapping of each gene.
Mapped gene i Yi lty1, y2, ., ymgt

_at_PNAS (2000)
19
Clustering Using Statistics

Algorithm identifies groups.
Example similar expression profiles
Distance measure between pairs of points is
needed.

20
Distance Measures Between Pairs of Points

In order to cluster the points (genes or
conditions), we need some concept of which points
are close to each other.
So we need a measure of distance (or,
conversely,) similarity between two rows (or
columns) in our n by m matrix.
We can then compute all the pair-wise distances
between rows (or columns).

21
Standard Distance Measures

Euclidean Distance
Pearson Correlation Coefficient
Mahalanobis Distance

22
Euclidean Distance

Standard, everyday distance
Treats all dimensions equally
If some genes vary more than others (have higher
variance), they influence the distance more.

23
Mahalanobis Distance

The normalized Euclidean distance
Scales each dimension by the variance in that
dimension.
This is useful if the genes tend to vary much
more in one sample than in others since it
reduces the affect of that sample on the
distances.

24
Pearson Correlation Coefficient

Distances are small when two genes have similar
patterns of change even if the size of the
changes are different.
This is accomplished by scaling by the sample
variance of the genes expression levels under
different conditions.

25
Choice of Distance Matters

Heirarchical clustering (dentrogram) of tissues.
Corresponds to clustering the columns of the
matrix.
Branches are different (cancer B/C vs A/B).

26
Clustering Algorithms

Hierarchical Clustering
K-means clustering
Self-organizing maps and trees

27
Hierachical Clustering

Algorithms progressively merge clusters or split
clusters.
Merging criterion can by single-linkage or
complete-linkage.
Produce dendrograms
Can be interpreted at different thresholds.

28
Types of Linkage

A. Single Linkage
B. Complete Linkage
C. Centroid Method

29
K-means Clutering

Related to Expectation Maximization
You specify the number of clusters
Iteratively moves the means of the clusters to
maximize the likelihood (minimize total error).

Write a Comment

User Comments (0)