Analyzing Expression Data: Clustering and Stats - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Analyzing Expression Data: Clustering and Stats

Description:

Mapped gene i: Yi = y1, y2, ...., ym _at_PNAS (2000) Clustering Using Statistics ... have similar patterns of change even if the size of the changes are different. ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 30
Provided by: research17
Category:

less

Transcript and Presenter's Notes

Title: Analyzing Expression Data: Clustering and Stats


1
Analyzing Expression DataClustering and Stats
  • Chapter 16

2
Goals
  • Weve measured the expression of genes or
    proteins using the technologies discussed
    previously.
  • What can we do with that information?
  • Identify significant differences in expression
  • Identify similar patterns of expression
    (clustering)

3
Analysis steps
  • Data normalization
  • Statistical Analysis
  • Cluster Analysis

4
I. Data Normalization
  • Why normalize?
  • Removes systematic errors
  • Makes the data easier to analyze statistically

5
Sources of Error
  • Measurements always contain errors.
  • Systematic (oops)
  • Random (noise!)
  • Subtracting the background level can remove some
    systematic error
  • Using the ratio in two-channel experiments does
    this
  • Subtracting the overall average intensity can be
    used with one-channel data.
  • Taking averages over replicates of the experiment
    reduces the random error.
  • Advanced error models are mentioned on p. 628 and
    covered in Further Reading.

6
Expression data usually not Gaussian (normal)
  • Many statistical tests assume that the data is
    normally distributed.
  • Expression microarray spot intensity data (for
    example) is not.
  • Intensity ratio data (two-channel) is not normal
    either.
  • Both go from 0 to infinity whereas normal data is
    symmetrical.

7
Taking the logarithm helps normalize expression
ratio data
  • The expression ratio plotted versus the
    expression level (geometric mean) in both
    channels.
  • Plotting the log ratio vs. the log expression
    level gives data that is centered around y0 and
    fairly normal looking.

8
Taking the log of the expression ratio fixes
the left tail
9
LOWESS Normalization
  • Sometimes there is still a bias that depends on
    the expression level.
  • This can be removed by a type of regression
    called Locally Weighted Scatterplot Smoothing.
  • This computes and subtracts the mean locally for
    various values of expression level (RG).

10
II. Statistical Analysis
  • Determining what differences in expression are
    statistically significant
  • Controlling false positives

11
When are two measurements significantly different?
  • We want to say that an expression ratio is
    significant if it is big enough (gt1) or small
    enough (lt1).
  • A two-fold ratio (for example) is only
    significant if the variances of the underlying
    measurements are sufficiently small.
  • The significance is related to the area of the
    overlap of the underlying distributions.

12
The Z-test
  • If the data is approximately normal, convert it
    to a Z-score.
  • X can be the log expression ratio ? is then 0
  • ? is the sample standard deviation n is the
    number of repeats
  • The Z-score is distributed N(0,1) (standard
    normal).
  • The significance level is the area in the tail(s)
    of the standard normal distribution.

13
The t-test
  • The t-test makes fewer assumptions about the data
    than the Z-test
  • It can be applied to compare two average
    measurements which can have
  • Different variances
  • Different numbers of observations
  • You compute the t-statistic (see pages 654-655)
    and then look up the significance level of the
    Students T distribution in a table.

14
III. Cluster Analysis
  • Similar expression patterns
  • Groups of genes/proteins with similar expression
    profiles
  • Similar expression sub-patterns
  • Groups of genes/proteins with similar expression
    profiles in a subset of conditions
  • Different clustering methods
  • Assessing the value of clusters

15
Example Gene Expression Profiles
  • Expression level of a gene is measured at
    different time points after treating cells.
  • Many different expression profiles are possible.
  • No effect
  • Immediate increase or decrease
  • Delayed increase or decrease
  • Transient increase or decrease

16
Clustering by Eye
  • n genes or proteins
  • m different samples (or conditions)
  • Represent a gene as a point
  • X ltx1, x2, , xmgt
  • If m is 1 or 2 (or even 3) you can plot the
    points and look for clusters of genes with
    similar expression.
  • But what if m is bigger than 3?
  • Need to reduce the dimensionality PCA

17
Reducing the Dimensionality of Data Principal
Components Analysis
  • PCA linearly map each point to a small set of
    dimensions (components).
  • The principal components are dimensions that
    capture the maximum variation in the data.
  • The principal components capture most of the
    important information in the data (usually).
  • Plotting each points values in two of the
    principal component dimensions allows us to see
    clusters.

2-D Gel Data
18
PCA An IllustrationYeast Cell Cycle Gene
Expression
  • Singular value decomposition of a matrix X (SVD)
    is
  • X U ? VT
  • The mapped value of X is
  • Y X VT
  • The rows of Y give the mapping of each gene.
  • Mapped gene i Yi lty1, y2, ., ymgt

_at_PNAS (2000)
19
Clustering Using Statistics
  • Algorithm identifies groups.
  • Example similar expression profiles
  • Distance measure between pairs of points is
    needed.

20
Distance Measures Between Pairs of Points
  • In order to cluster the points (genes or
    conditions), we need some concept of which points
    are close to each other.
  • So we need a measure of distance (or,
    conversely,) similarity between two rows (or
    columns) in our n by m matrix.
  • We can then compute all the pair-wise distances
    between rows (or columns).

21
Standard Distance Measures
  • Euclidean Distance
  • Pearson Correlation Coefficient
  • Mahalanobis Distance

22
Euclidean Distance
  • Standard, everyday distance
  • Treats all dimensions equally
  • If some genes vary more than others (have higher
    variance), they influence the distance more.

23
Mahalanobis Distance
  • The normalized Euclidean distance
  • Scales each dimension by the variance in that
    dimension.
  • This is useful if the genes tend to vary much
    more in one sample than in others since it
    reduces the affect of that sample on the
    distances.

24
Pearson Correlation Coefficient
  • Distances are small when two genes have similar
    patterns of change even if the size of the
    changes are different.
  • This is accomplished by scaling by the sample
    variance of the genes expression levels under
    different conditions.

25
Choice of Distance Matters
  • Heirarchical clustering (dentrogram) of tissues.
  • Corresponds to clustering the columns of the
    matrix.
  • Branches are different (cancer B/C vs A/B).

26
Clustering Algorithms
  • Hierarchical Clustering
  • K-means clustering
  • Self-organizing maps and trees

27
Hierachical Clustering
  • Algorithms progressively merge clusters or split
    clusters.
  • Merging criterion can by single-linkage or
    complete-linkage.
  • Produce dendrograms
  • Can be interpreted at different thresholds.

28
Types of Linkage
  • A. Single Linkage
  • B. Complete Linkage
  • C. Centroid Method

29
K-means Clutering
  • Related to Expectation Maximization
  • You specify the number of clusters
  • Iteratively moves the means of the clusters to
    maximize the likelihood (minimize total error).
Write a Comment
User Comments (0)
About PowerShow.com