Statistical Analysis of DNA Microarray. - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Statistical Analysis of DNA Microarray.

Description:

Title: An example of HDLSS: Microarray data Author: rizem Last modified by: teaching Created Date: 5/4/2001 2:08:15 AM Document presentation format – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 47
Provided by: riz74
Category:

less

Transcript and Presenter's Notes

Title: Statistical Analysis of DNA Microarray.


1
Statistical Analysis of DNA Microarray.
  • An Example of HDLSS in Genetics.

2
The Data
3
Expression Matrix
  • Rows represent genes feature vectors.
  • Columns represent different cell samples. Ex
    cancer cells from different patients.
  • Each element (i,j) of the array represents the
    expression level of gene i in cell sample j.

4
Goal of Analysis of Expression Matrix
  • Some statistical methods applied to
  • Group similar genes together gt groups of
    functionally similar genes.
  • Extract representative gene in each group.
  • Group similar cell samples together.

5
Overview DNA Microarray Technology
  • One cell sample.
  • Level of expression.
  • Microarray technique.

6
Getting the Data... One Cell Sample at a Time

7
Getting the Datameasuring the Level of
Expression Gene by Gene.
  • Each spot in this DNA microarray represents the
    level of expression of a single gene in the tumor
    cell compared to a reference cell.
  • Standardize the level of expression of this cell
    to make it comparable to other cells.

Expressed in reference cell. Expressed in
reference and tumor cell. Expressed in tumor
cell Nor expressed.
8
Level of Expression mRNA
9
Level of Expression mRNA
  • All the cells contain the same DNA same genes,
    but in one cell not all genes are active.
  • What differentiate the cells is what genes are
    active or expressed.
  • To measure the cell expression we measure the
    genetic molecule RNA messenger denoted by mRNA.

10
Measuring The Level of Expression Complementary
Strands
11
RNAm DNA
  • RNAm is one strand copy of a piece of DNA.
  • Highly unstable.
  • DNA is double stranded, one strand complementary
    to the other.
  • Stable.

12
Getting One Sample Microarray Technique
13
Microarray Technique (Cont.)The Microarray
Microarrays are made from a collection of
purified DNA's. A drop of each type of DNA in
solution is placed onto a specially-prepared
glass microscope slide by an arraying machine.
The arraying machine can quickly produce a
regular grid of thousands of spots in a square
about 2 cm on a side, small enough to fit under a
standard slide coverslip. The DNA in the spots is
bonded to the glass to keep it from washing off
during the hybridization reaction
14
Microarray Technique (Cont.) Description of the
Method
  • Definition of Microarray from the National Human
    Genome Research Institute
  • The method uses a robot to precisely apply
    droplets containing functional DNA to glass
    slides. Researchers then attach fluorescent
    labels to DNA from the cell they are studying.
    The labeled probes are allowed to bind to
    complementary DNA strands on the slides. The
    slides are put into a scanning microscope that
    can measure the brightness of each fluorescent
    dot brightness reveals how much of a specific
    DNA fragment is present, an indicator of how
    active it is.

15
Microarray Technique (Cont.) The Method Step by
Step
  • First step to measure the gene expression level
    of a cell, collect RNAm from the cell of
    interest, usually cancer cell. Have the same
    quantity of RNAm from a reference cell.
  • Second step RNAm to cDNA.
  • The RNAm is highly unstable, to stabilize it we
    complement the strand and create
    cDNA(complementary DNA)  .
  • Third step creates cDNA probes.
  • Label cDNA from each cell by fluorescent dyes. A
    differently colored fluor is used for each
    sample.

16
Microarray Technique The Method Step by Step
(Contd.)
  •  Fourth step hybridize the cDNA probes from the
    two samples to the Microarray. Once the cDNA
    probes have been hybridized to the array and any
    loose probe has been washed off, the array must
    be scanned to determine how much of each probe is
    bound to each spot.

17
Statistical Methods
  • Clustering.
  • Gene shaving algorithm use of PCA for
    clustering.

18
Clustering Overview
- Kmean clustering. - Hierarchical clustering. -
Validation method.
19
What Is Clustering?
  • For a sample of size n described by a
    d-dimensional feature space,Clustering is a
    procedure that
  • . Divide the d-dimensional feature space
    in k disjoint groups.
  • . Data points within each group are more similar
    to each other than to any data point in other
    groups.

Illustration for n 45, d 2 and k 3.
20
Similarity Between Feature Vectors
  • Choice of the similarity function depends on the
    data. For example if data is invariant by linear
    transformation or rotation than the similarity
    function has to be invariant too. Similarity
    function could be a distance or an inner product.
  • Examples of similarity functions
  • Euclidean distance, used to illustrate for d 2.
  • Correlation is used for microarray data.

21
K-means Clustering
  • Divide the d dimensional feature space on k
    parts described by Voronoi partition of the k
    mean vectors.
  • Algorithm finds the vector of means of clusters.

Illustration for d 2 and k 3, red points
represent means of clusters and red lines
represent Voronoi partition.
22
Algorithm for K-means Clustering
  • Algorithm
  • Begin initialize n, k, m1, m2,..., mk
  • Do classify n samples according to nearest mi
  • recompute mi
  • until no change in mi
  • return m1, m2,..., mk
  • end
  • Computational Complexity O(ndkT) T is the number
    of iterations

For d 2, illustration of the trajectories of
the 3 means.
23
K-mean Clustering for Microarray Data
  • Cf picture k.mean.
  • K-means clustering of lymphoma data. Lymphoma
    profiles were clustered using the expression of
    148 germinal-center-specific genes and Euclidean
    distance metric.(a) represents the germinal-cell
    subtype and (b) represents the activated
    subtype. Each column represents a specific gene
    and each row a specific cancer profile.

24
Hierarchical Clustering
Venn Diagram of Clustered Data
Dendrogram
25
Hierarchical Clustering (Cont.)
  • Multilevel clustering, at level 1 we have n
    clusters and at level n we have one cluster.
  • Agglomerative HC starts with singleton and merge
    clusters.
  • Divisive HC starts with one sample and split
    clusters.

26
Hierarchical Clustering Nearest Neighbor
Algorithm
  • Nearest Neighbor Algorithm is an agglomerative HC
    (bottom-up).
  • The algorithm starts with n nodes (n is the size
    of our sample). At every level the 2 most
    similar nodes are merged together into one node.
    The algorithm stops when we get the desired
    number of clusters.

27
Nearest Neighbor, data to cluster.
28
Nearest Neighbor, Level 2, k 7 clusters.
29
Nearest Neighbor, Level 3, k 6 clusters.
30
Nearest Neighbor, Level 4, k 5 clusters.
31
Nearest Neighbor, Level 5, k 4 clusters.
32
Nearest Neighbor, Level 6, k 3 clusters.
33
Nearest Neighbor, Level 7, k 2 clusters.
34
Nearest Neighbor, Level 8, k 1 cluster.
35
(No Transcript)
36
Results of Hierarchical Clustering on Microarray
Data
  • Grouping similar functional genes.
  • Grouping similar cell samples.
  • Cf picture Perou.trend.review2001.pdf file page6.

37
Criterion Function for Clustering
  • Criterion Functions depend on grouping and
    number of clusters. Examples are
  • Sum of squared errors ? ? x - mi 2.
  • Scatter Criteria SW / ST where STSWSB .
  • i.e. decompose the total scatter matrix into
    between-cluster scatter matrix and within-cluster
    scatter matrix.
  • Best cluster minimizes the criterion.

38
Gene Shaving
  • The gene shaving method is also a method of
    clustering genes and sample cells. But unlike
    classic clustering, in this method one gene could
    belong to more than one cluster.

39
Gene Shaving Iteration
40
Gene Shavingiteration
  • 1. Start with the entire expression matrix X,
    each row centered to have zero mean.
  • 2. Compute the leading PC of the rows of X.
  • 3. Shave off the proportion alpha (10) of the
    genes having smallest absolute inner-product with
    the leading PC.
  • 4. repeats steps 2 and 3 until only one gene
    remains.
  • 5. This produces a nested sequence of gene
    clusters Sn?...? Sk ? ? S 1 where Sk denotes a
    cluster of k genes. Estimates the optimal cluster
    size k using the gap statistic.
  • 6. Orthogonalize each row of X with respect to ?
    Sk , the average gene in Sk , optimal from
    step5.
  • 7. Repeat steps 1-5 with orthogonalized data, to
    find the second optimal cluster. This process
    continued until a max of M clusters are found.

41
To Estimate Cluster Size Gap Estimate
  • For cluster Sk let Dk be the scatter estimate.
    i.e Dk 100 SB/ST.
  • For b in 1,,B, let
  • X (b) permuted data matrix ( permuting the
    elements within each row of X ).
  • Dk (b) is the scatter estimate for cluster Sk
    (b).
  • Dk is the mean of Dk (b)s.
  • Gap(k) Dk - Dk .
  • Choose k that produces the largest gap.

42
Gene Shaving (Cont.)
The first three gene clusters found for the DLCL
data
43
Gene Shaving (Cont.)
Percent of gene variance explained by first j
gene shaving column averages (j 1,2,... 10)
(solid curve), and by first j principal
components (broken curve). For the shaving
results, the total number of genes in the first j
clusters is also indicated.
44
Gene Shaving ( Cont.)
a) Variance plots for real and randomized data.
The percent variance explained by each cluster,
both for the original data, and for an average
over three randomized versions. (b) Gap estimates
of cluster size. The gap curve, which highlights
the difference between the pair of curves, is
shown.
45
References
  • Pattern Classification Richard O.Duda, Peter
    E.Hart and David G.Stork Chapter 10.
  • Gene Shaving as a method for identifying
    distinct sets of genes with similar expression
    patterns T. Hastie, R. Tibshirani, M.B. Eisen, A
    Alizadeh, R. Levy,L Staudt, W.C Chan, D.Botstein
    and P. Brown. Genome Biology 2000.
    http//genomebiology.com/2000/1/2/research/0003/B
    14.
  • Cluster analysis and display of genome-wide
    expression patterns, PNAS (1998).

46
References
  • Basic microarray analysis grouping and feature
    reduction. S. Raychaudhuri, P.Sutphin, J.T. Chang
    and Russ B. Trends in Biotechnology 2001.
  • Tumor classification using gene expression
    patterns from DNA microarrays.Charles M. Perou,
    Patrick O.Brown and David Botstein. Trends in
    Molecular medicine ,December 2000.
  • Pictures and definition of microarray technology
    from National Human Genome Research Institute
Write a Comment
User Comments (0)
About PowerShow.com