Gene Expression Data Analysis - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Gene Expression Data Analysis

Description:

Gene Expression Data Analysis – PowerPoint PPT presentation

Number of Views:635

Avg rating:3.0/5.0

Slides: 65

Provided by: bioinforma7

Category:

more less

Transcript and Presenter's Notes

Title: Gene Expression Data Analysis

1
Gene Expression Data
Analysis

Zhang Louxin
Dept. of
Mathematics
Nat. University of
Singapore

2
(No Transcript)
3
RNA Transcription ( by L. Miller)
mRNA
RNA polymerase
3
5
4
The Transcriptome (by L. Miller)
5
CDNA Microarray
Based on hybridization principle Use
parallelism so that one can observe the activity
of thousands of genes at a time
P.Brown/Stanford
6
Paradigm for Using cDNA Micro-arrays
Animals
Patients
Cell Lines
Appropriate Tissue
Extract RNA
Microarray
Hybridization
Scan Microarray
Scan microarray
Computer Analysis
Data measures the relative ratio of mRNA
abundance of each gene in test sample to ref.
7
cDNA microarray schema -- P. Browns approach
Data from a single experiment measures the
relative ratio of mRNA abundance of each gene on
the array in the two samples (D. Duggan et al.,
Nature Genetics, 1999)
8
Affymetrix GeneChip Probe Arrays
Single stranded, fluorescently labeled DNA target
9
cDNA microarray
1. DNA microarrays are ordered assemblies of DNA
sequences immobilized on a solid support
(such as chemically modified glass).
10
What is a DNA microarray?
2. The DNA sequences (e.g. PCR products or long
oligos) correspond to the transcribed regions
of genes.
genomic DNA
exon 1
exon 2
exon 3
gene Y
ATTTCAGGCGCATGCTCGG
gene X
gene Z
11
What is a DNA microarray?
3. The DNA sequences (aka, probes) are capable
of anneal- ling with cDNA targets derived
from the mRNA of a cell.
12
Scanning/Signal Detection
Cy3 channel
Cy5 channel
13
GIS COMP 19K human oligo array v1.0p2
14
Applications

Gene function assignment guilt-by- association
Cluster genes together
into groups
unknown genes are assigned a function
based on the known functions of genes in the same
expression cluster.
Gene prediction
The regulatory network of living cells
For a given
cell, arrays can produce a snapshot revealing
which genes are on or off at a particular time.
Clinical diagnosis ( especially for cancers) .
Cancers are caused by gene
disorders. These disorders result in a deviation
of the gene expression profile from that of the
normal cell.

15
Microarray Data Analysis
Array Quantification (from digital image)
Quality control
Data Mining
16
Gene Expression Matrix
17
Difficulties of the Analysis

The myriad random and systematic measurement
errors
Small numbers of samples (cell lines, patients),
but the large number of variables (probes or
genes)

Random errors are caused by the time that
the array are processed, target accessibility,
variation in washing procedures.
System errors are bias. They result in a constant
tendency to over- or underestimate true values.
Biasing factors are dependent on spotting,
scanning
labelling technologies.

18
Normalization 1- ratio and log transformation
Ratio of raw expression from image quantification
are usually not appropriate for statistical
analysis. Log-transformed data are usually
used. Why? (1). The log transformation removes
much of the proportional relationship
between random error and signal intensity.
Most statistical tests assume an additive error
model. (2). Distributions of replicated logged
expression values tend to be normal.
(3). Summary statistics of log ratio yield same
quantities, regardless the
numerator/denominator assignment.
Example Consider treatmentcontrol ratios for
three replicates 21.1, 51.4,
15 2 and inverted ratios. They have
difference means and standard deviations
but their logs have same means (different
signs) and deviations.
19
Normalization 2 - normalize two experiments

The expression levels of genes are normalized to
a common
standard so that they can be compared.
Power of microarray analysis comes from the
analysis of
many experiments to identify common patterns of
expression
Techniques
Housekeeping genes
Spiked controls
Global normalization to overall distribution

exp2
Exp. value
exp1
experiments
20
Normalization 3 -Outliers
Concept Outliers are extreme values in a
distribution of replicates.
The number can be as high as
15 in a typical microarray experiments. Reason
(1). They are caused by image artifacts (e.g.
dust on a cDNA array, or blooming
of adjoining spots on radioisotopic array).
(2). They can also be caused by the
factors such as cross-
hybridization or failure of one probe to
hybridization adequately. Detection Large
sample sizes are needed to detect outliers more
accurately and precisely.
Estimate
errors on all the probes, rather than a
probe-by-probe basis.
21
Mining Gene Expression DATA
Classification Classifying genes (or
tissues, condition) into groups each
containing genes (or tissues) with similar
attributes. Class Prediction Given a set
of known classes of genes (or tissues),
determine the correct class for new genes
(or tissues).
22
PART 1Molecular Classification
Traditional Clustering Algorithms
K-means, Self-Organising Maps,
Hierarchical Clustering
Graph Theoretic-based Clustering
Algorithms (Ben-Dor et al.99, Eartuv et al.99)
23
K-means, Self-Organising Maps
Input Gene expression matrix, and an integer
k Output k disjoint groups of genes with
similar expression.
Clustering genes
K3
Exp
exp1
exp4
24
Similarity and Dissimilarity Measures
Sim
Two main classes of distance functions are used
here -- Correlation coefficients for
comparing shape of expression curves.
Pearson Correlation Coefficient --
Metric distance for measuring the distance
between two points in a metric space.
Manhattan distance, Euclidean distance.
25
Pearson Correlation Coefficient p(X, Y) (between
-1 and 1)
Sim
Let
and
are standard deviation of X and Y
X
X
Y
Negative correlation
Positive correlation
26
Pearson Correlation Coefficient p(X, Y) (between
-1 and 1)
Sim
Let
Pitfalls
X
Large correlation
Y
27
Distance metrics

Let

Euclidean distance

Y
-- Most commonly used distance -- Identical to
the geometric distance in the
multidimensional space.

Manhattan distance

Y
-- Sum of difference across dimensions
X
28
K-means Algorithm
Arbitrarily partition the input points into K
clusters Each cluster is represented by its
geometrical center. Repeatedly adjust K clusters
by assigning a point to the nearest cluster.
1
1
2
initial
Input Points
K3
29
Hierarchical Clustering Algorithm
Input Some data points Output A set of
clusters arranged in a tree - a
hierarchical structure.
What is the distance between clusters? Average
pairwise distance
Each internal node corresponds a cluster.
30
Identify Subtypes of
Diffuse large B-Cell Lymphoma ( DLBCL )
(Alizadeh et al. Nature, 2000)

A special cDNA microarray --Lymphochip was
designed
Study gene expression patterns in three lymphoid
malignancies DLBCL, FL and CLL.

12,069 cDNA clones from germinal centre B-cell
library 2,338 cDNA clones from libraries derived
from DLBCL, follicular lymph.(FL), mantle
cell lymph, and chronic lymphocytic
leukaemia(CLL) 3,349 other cDNA clones.
96 normal and malignant lymphocyte samples
31
Germinal centre B-like DLBCLvsActivated B-like
DLBCL
Courtesy Alizadeh
32
Germinal centre B-like DLBCLvsActivated B-like
DLBCL
International Prognostic Indicator
Courtsey Alizadeh
33
Remarks

Programmes designed to cluster data generally
re-order
the rows, or columns, or both, such that
pattern of expression
becomes visually apparent when present in this
fashion.
There might never be a best approach for
clustering data.
Different approaches allow different aspects
of the data to
be explored.
They are subjective. Different distance
metrics will place
different objects in different clusters.
Understanding the underlying biology,
particularly of
gene regulation, is important.

34
Research Problem
Bi-clustering cluster genes and experiments at
the same time Why? Some
genes are only co-regulated in a subset of
conditions (experiments). References
Y. Kluger et al. Spectral Biclustering of
Microarray data Coclustering Genes and
Conditions, Genome Res. 13, 703-716.
L. Zhang and S. Zhu. A New Clustering Method for
macroarray data analysis. Proc. IEEE CSB
2002.
35
Molecular Class Prediction

Several supervised learning methods available
Neural Networks
Support Vector Machines
Decision trees
Other statistical methods

36
A Supervised Learning Method for Predicting a
Binary Class
Positive and negative examples
Yes
Learning
Prediction
No
?
A new item
A class is just a concept! In the learning step,
the class is modelled as a math. object -- a
function with multiple variables, or a subspace
in a high dimensional space, representing
knowledge of the class.
37
Learning the class of tall men
The class is modelled as the half space hgt63
Examples
38
Support Vector Machines
A support vector machine finds a hyperplane that
maximally separate data points into two classes
in the feature space.
?
39
Molecular Class Prediction-- Leukemia Case
Morphology does not distinguish leukemias very
well. Golub et al. (Science, 1999) proposed a
voting method for predicting Acute
lymphoblastic leukemia(ALL) and Acute Myeloid
Leukemia(AML) using gene expression
fingerprinting.
In the work, Affymetrix DNA chip with 6817
genes was used for 72 ALL/AML samples.
40
The voting algorithm(Golub99
Courtesy Golub
1. Select a subset of (2X25) genes highly
correlating with ALL/AML distinction based
on 38 training samples.
Correlation metric
the mean expression level of g in AML (ALL)
samples
the within-class standard deviation of
expression of g in AML (ALL) samples.
2. Each selected gene casts a weighted vote for
a new sample the total of the weighted
votes decides the winning class.
41
The voting method Separating samples by
hyperplanes
Mathematically, the total of all the votes on a
new sample X is
is the expression level of in the new
sample X.
If Vgt0, X is classified as AML otherwise , X is
ALL.
AML
ALL
42
Decision Tree Learning

Information-reduction learning method.
Representing a class or concept as a logic
sentence.
When to use decision trees

IF (Outlook Sunny) (Humidity Normal)
THEN playTennis

Instance describable by attribute-value pairs
Target function is discrete valued
Possibly noisy training data

Examples medical diagnosis, credit risk analysis
43
Textbook ExamplePlayTennis

Each internal node tests
an attribute
Each branch has a value
Each leaf assigns a
classification

44
Remarks

Decision tree is constructed by top-down
induction
Preference for short trees, and for those with
high
information gain attributes near the root.
Information is measured with entropy.

45
ALL vs AML - Decision Tree this time
(Y. Sun, tech report, MIT)

Single gene (zyxin), single branch tree
Tree size up to 3 genes

38/38 correct on training cases 31/34 correct on
test cases, 3 errors
X5735_at lt(81)38 ALL
1 decision tree with 1 error 7 decision trees
with 2 errors 7 decision trees with 3 errors
46
Gene Selection
Gene Selection is critical in molecular class
prediction as we learn from decision tree
results. Why?

In a cellular processe, only a relatively small
set
of genes are active.
Mathematically, each gene is just a feature.
The more weak features, the more noise the data.
More features arise overfitting problem.

Research Problem How to select genes?
47
Two Approaches
1. Gene selection is done first, and then
use these genes to learn such as Golub et als
paper. 2. Gene selection and learning are done
together, like decision tree
learning. Does this make difference in learning?
48
Discovery
49
(No Transcript)
50
BioCluster
Similarity Measure
Cluster Number
Clustering Methods
Clustering Methods
Self-Organising Map(SOM)
Hierarchical
K-Means
Microarray Data Sets
51
Concluding remarks

Some of previous works and our work in analysing
gene expression data are summarised.
Our group will focus on designing more efficient
and sophisticated algorithms and software tools
for mining and visualizing gene expression data.

52
Advantages of using arrays
A microarray contains up to 8000 genes or
probes, and hence it is not necessary to guess
what the important genes or mechanisms are in
advance An array produces abroader, more
complete, less biased , genome-wide expression
profiling.
53
Problems with Traditional Clustering Algorithms
1
1. They are not quite suitable for studying
genes with multiple functions or regulated
under multiple factors 2. They cannot handle
data errors or missing well.
Error or missing often occur when tissues are
rare Normalisation of expression levels across
different experiments is also problematic.
54
Our Approach (ZZ00)
Let A be a gene expression matrix with gene set
X and experiment set Y let Then I and J
specifies a submatrix A(I, J). We associate the
following score with each entry of A(I,J)
A(I, J) is ?-smooth if S (i,j) ?? for all
i?I, j?J.
IJ
55
Clustering Problems

Smooth Cluster Problem
Instance A gene expression matrix A with gene
set X and
experiment set Y, a subset J?Y,
a number ??0
Question Find a largest ?-smooth submatrix
A(I, J).

To handle genes with multiple functions, we use
a known idea (Hartigan72, Cheng and Church00)

Smooth Bicluster Problem
Instance A gene expression matrix A with gene
set X and
experiment set Y, a number ??0
Question Find I??X and J?Y with largest
min(I, J)
and such that A(I, J) is
?-smooth.

56
Greedy Algorithms for Smooth Cluster Problem
Top-Down Algorithm
Input A gene expression matrix A with gene set X
and experiment set Y, a subset J?Y, a
number ??gt0 Output A set I?X such that A(I, J)
is ?-smooth under J. Set IX initially Repeat
If A(I, J) is ?-smooth, stop Select a
row i?I that is furthest from the clusters
center and remove it, that is, I I
-i End repeat.
57
Top-Down and Bottom-Up Algorithm
Input A gene expression matrix A with gene set X
and experiment set Y, a subset J?Y, a
number ??gt0 Output A set I?X such that A(I, J)
is ?-smooth under J. Set IX initially Apply
Top-Down Algorithm first Repeat Select a
row r?X-I that is closest to the center of
cluster I if A(Ir, J) is ?-smooth,
IIr End repeat.
58
Algorithm (Finding a given number of clusters)
Input A gene expression matrix A with gene set X
and experiment set Y, a subset J?Y, a
number ??gt0, and n, the number of
?-smooth clusters to be found Output A set CS
of n ?-smooth clusters under conditions in J.
IX CS? / Output cluster set
/ Repeat n times Run Top-Down Algorithm
on the set I of unselected genes to get a
?-smooth cluster C Apply Bottom-Up on X
to extend C CS CS C I I -C End
repeat.
C
C
59
Experiments with the Yeast Data
The Data Set (Tavazoie et al.99) 2884 genes,
17 conditions. Experiments with
K-means Algorithm with k30, Pearson
coefficient. Our Algorithm 1 with
smooth score ? 50 output over
hundred 50-smooth clusters.
60
Clusters from our algorithm have strong patterns.
The 22nd cluster from K-means
corresponds roughly 12 smooth clusters from our
algorithms
61
More Smooth Clusters
62
Characterization of Our Algorithms
Our algorithm first clusters all low-fluctuating
genes or noises into one or two clusters, while
K-means algorithm assigns these genes into many
clusters.
First two clusters from our algorithm
63
Comparison with functional categories
Our approach is systematic and blind to knowledge
of yeast. However, there is significant grouping
of genes within the same functional category in
many of discovered smooth clusters.
64
Performance evaluationTest on DLBCL case

Write a Comment

User Comments (0)