Presentazione di PowerPoint - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Presentazione di PowerPoint

Description:

Logical connection between the function of a gene and its pattern of expression. ... P(M|D) is called a posteriori probability: probability of M given the data D. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 32
Provided by: gall92
Category:

less

Transcript and Presenter's Notes

Title: Presentazione di PowerPoint


1
Observatory of Complex Systems
http//lagash.dft.unipa.it
L3 - Microarray Experiments
Salvatore Miccichè
Università degli Studi di PalermoDipartimento di
Fisica e Tecnologie Relative
Scuola di Dottorato in Fisica Applicata - XX
cicloCorso di Bioinformatica Università degli
Studi di Palermo - 22 Maggio 2006
2
1) Basic Introduction to Microarray Experiments
- what they are - getting the results
from the experiment
2) Statistical Analysis of Microarray Data -
statistical analysis - Gene Ontology
3) Results - Microarray and Cancer -
Microarray and microRNA
1 P. Baldi,G. W. HatfdieldDNA Microarrays and
Gene ExpressionCambridge University Press,
Cambridge, UK, (2002) ISBN 0 521 80022 6
2 A.D. Baxevanis, B.F.F. OulletteBioinformatics
J Wiley Sons, New Jersey (2005) ISBN 0 471
47878 4
3
THEME 1What Microarray experiments are
1) Basic Introduction to Microarray Experiments
- what they are - getting the results
from the experiment
1 P. Baldi,G. W. HatfdieldDNA Microarrays and
Gene ExpressionCambridge University Press,
Cambridge , UK, (2002) ISBN 0 521 80022 6
4
1.1 Microarray Experiments
1.1.1 Motivations
Array technologies monitor the combinatorial
interaction of a set of DNA fragments/proteins,
with a predetermined library of molecular probes.
The currently most advanced of these technologies
is the use of DNA arrays, also called DNA chips,
for simultaneously measuring the level of mRNA
gene product of a living cell. 1
5
1.1 Microarray Experiments
3 out of many mRNAs developed in two different
environments (with and without oxygen)
1.1.2 How the experiment works
Every spot represents a different coding sequence
from a different gene
http//www.bio.davidson.edu/courses/genomics/chip/
chip.html
6
1.1 Microarray Experiments
1.1.2 How the experiment works
Some genes are expressed both in and without
presence of oxygen
Logical connection between the function of a gene
and its pattern of expression.
3 D.T. Ross et al., Systematic variation in
gene expression patterns in human cancer cell
lines.Nature Genetics, 24, 227, (2000)
7
1.2 Microarray Experiments
1.2.1 The redgreen intensities
The intensity of Red/Green of each spot are
computed by (i) measuring the mean fluorescence
emitted by the spot (ii) and subtracting the mean
fluorescence of the Red/Green emitted by a small
area surrounding the spot RRspot-Rbackground
GGspot-Gbackground
Fluorescent signals inverted
Fluorescent signals normal
Number of spots 43931 Number of 60-words in the
chip 41000 check-spot 31419422256 675
60-words are repeated in different spots. Number
of distinct genes in the chip 18637
To minimize the possibility that 60-mers match
differently with the two dyes
8
1.2 Microarray Experiments
1.2.2 The MA-plot
The orizontal axis shows
The vertical axis shows
The bending on the left-down corner, i.e. low
intensity values is usually due to poor
resolution of the photomultipliers.
9
1.2 Microarray Experiments
1.2.3 The normalization procedure
The normalization procedure is performed by using
the globally LOWESS (LOcally WEightes
Scatter-plot Smothing) algorithm.
As a result, one obtains normalized values of M
and A and therefore one also gets normalized
values of R and G
Rn Gn
10
1.2 Microarray Experiments
1.2.4 The ratio over/under-expressed genes
Since the normalized values of M (log-ratio) are
distributed around the M0 line, we can easily
check which gene is over/under expressed
One can set a threshold and select the genes that
are over/under expressed with respect to it
Threshold1 Mgt1 ? R/Ggt2 Mlt-1 ? R/Glt0.5
data
11
THEME 2Interpretation of Microarray experiments
data
2) Statistical Analysis of Microarray Data -
statistical analysis - Gene Ontology
1 P. Baldi,G. W. HatfdieldDNA Microarrays and
Gene ExpressionCambridge University Press,
Cambridge, UK, (2002) ISBN 0 521 80022 6
2 A.D. Baxevanis, B.F.F. OulletteBioinformatics
J Wiley Sons, New Jersey (2005) ISBN 0 471
47878 4
12
2.1 Statistical Analysis
2.1.1 The problem
Although many data analysis techniques have
been applied to DNA array data, the field is
still evolving and the methods have not yet
reached a level of maturity. Even very basic
issues of signal-to-noise ratio are still being
sorted out. ... 1
Level 1 - single genes one seeks to establish
whether each gene behaves differently in a
control case versus a treatment situation
Level 2a - multiple genes cluster of genes are
looked for. They are studied with the aim of
finding co-regulation, interactions, common
functionalities.
Level 2b - multiple genes networks of
genes/proteins
13
2.1 Statistical Analysis
2.1.1 The problem
Suppose to have N genes and M measurements
(experiments) Consider the gene X.
Consider the M measurements of such gene x1, ,
xM treatment y1, yM control Notice that M can
be very small, even M1.
14
2.1 Statistical Analysis
2.1.2 Level 1 analysis levels of expression
One wants to understand whether the level of
expression is significantly different in the
control and treatment regime.
APPROACH 1 - fix a threshold for the ratio of
intensities and check which gene is over/under
expressed.
Treat in the same way high level intensities and
low-level intensities.
APPROACH 2 - T-test
mcmean of control mtmean of treatment scstand.
dev. of control ststand. dev. of treatment
Here the problem is the significance of the
mean/sd values. Usually M is very small !
15
2.1 Statistical Analysis
2.1.2 Level 1 analysis levels of expression
APPROACH 3 - T-test in the Bayesian Approach
Bayes Theorem
P(MD) is called a posteriori probability
probability of M given the data D. P(M) is called
a priori probability. P(DM) is called likelihood
of the model.
Bayesian inference
Let us suppose that M is the model and D is the
dataset
We can use the bayesian approach and get the
probability that a given gene g is
16
2.1 Statistical Analysis
2.1.2 Level 1 analysis levels of expression
We can use the bayesian approach and get the
probability P(MD) that a given gene X is
observed, given the fact the we observe a certain
level of expression xi (yi). This can be done for
the treatment and the control case.
As much as in the example, a key role is played
by the a priori probability P(M) and the
likelihood P(DM).
1) A minimal assumption is 1) all
data points are independent from each other
2) the data of different genes are
extracted from different gaussian distributions
which directly leads to the choice
17
2.1 Statistical Analysis
2.1.2 Level 1 analysis levels of expression
2) the choice of the distribution P(M), i.e. the
distribution of the parameters of the above
gaussian distributions, can be done in different
ways. Another minimal assumption is the so-called
conjugate prior choice, which again consists in
choosing a gaussian distribution for the ?
paramenter and an inverse gamma distribution for
the variance parameter ?.
The above choices guarantees that an analytical
model-based closed form for ? and ? can be found.
Having that, one can perform a t-test on these
models-based estimates.
If Mgt4, the two approaches give similar results.
If Mlt4 the bayesian approach seems to work
better.
18
2.1 Statistical Analysis
2.1.2 Level 2 analysis multivariate analysis
Knowing that a genes behaviour has changed
between two situations is at best a first step.
However, most genes ACT in CONCERT with other
genes !!!
What DNA microarrays are really after are the
PATTERNS of expression across multiple genes and
experiments.
Therefore let us remove the assumption that genes
are independent and look at their correlations.
One can thus obtain clusters of (co-expressed)
genes which are associated to the same pathway
and with co-regulation.
Methodologies Principal Component Analysis
(PCA) Clustering
methodologies K-means

Hierarchical clustering
...

19
2.1 Statistical Analysis
2.1.2 Level 2 analysis multivariate analysis
APPROACH 1 - Principal Component Analysis
Suppose to have N genes whose expression is
measured in M different experiments, at different
times. Organize these data in a NxM matrix
D(dij) whose elements are the level of
expression of gene i in the j-th experiment.
Construct the covariance NxN matrix ?ED DT
Compute the eigenvalues ?k and eigenvectors uk.
Retain the first largest p of them. (Project the
measurement data on the sub-space generated by
the first p eigenvectors)
In microarray data the largest eigenvalues may
represent relevant expression patterns
20
2.1 Statistical Analysis
2.1.2 Level 2 analysis multivariate analysis
APPROACH 2 - Clustering
Suppose to have N genes whose expression is
measured in M different experiments, at different
times. Organize these data in a NxM matrix
D(dij) whose elements are the level of
expression of gene i in the j-th experiment.
Find a similarity measure. An example is the
correlation between levels of expression.
K-means algorithm
1) Fix the number K of clusters. This could
reflect the expected number of pathways involved
in the experiment.
2) Select K centers for each cluster. At the
first stage these are selected more or less at
random. Thse points are called centroids.
21
2.1 Statistical Analysis
2.1.2 Level 2 analysis multivariate analysis
3) All data points are assigned to the cluster
associated with the closest representative.
4) After the assignments new centroids for each
cluster are computed. This is done by averaging,
by taking the center of gravity of the cluster,
.
5) the procedure is iterated until fluctuations
remain under a predetermined threshold.
The bad point about this procedure is that one
needs to know in advance the number of clusters.
Hierarchical clustering
1a) Select the two elements with highest
similarity/smaller distance. This two elements
are joined together in a NODE.
1b) An expression pattern for the node is
constructed by averaging the two original
expression patterns.
22
2.1 Statistical Analysis
2.1.2 Level 2 analysis multivariate analysis
2) a NEW smaller (N-1)x(N-1) correlation/distance
matrix is computed. This is done by re-computing
the distances of the remaining elements from the
first node as the mean distance between the two
joined elements. The two joined elements are
replaced by a node.
3) the procedure is iterated N-1 times until a
single node remains.
The above procedure is the one used in the
context of Average Linkage Clustering Analysis.
The case when one re-computes the distances by
using the shortest distance rather than the mean
distance, corresponds to the Single Linkage
Clustering Analysis.
The bad point about this procedure is that one
does not have an obvious way to define clusters.
In fact the output of the above protocol is a
dendrogram (binary tree) in which the branches
are built based on the connections determined
between nodes as long as the aoglrithm progresses.
Usually clusters are determined by cutting the
branches of tree at more or less arbitrary points.
23
2.2 Gene Ontology
2.2.1 Gene Ontology
http//www.geneontology.org
The Gene Ontology project provides a controlled
vocabulary to describe gene and gene product
attributes in any organism. The Gene Ontology
(GO) project is a collaborative effort to address
the need for consistent descriptions of gene
products in different databases. The GO
collaborators are developing three structured,
controlled vocabularies (ontologies) that
describe gene products in terms of their
associated biological processes, cellular
components and molecular functions in a
species-independent manner.
24
2.2 Gene Ontology
2.2.1 Gene Ontology
25
2.2 Gene Ontology
2.2.1 Gene Ontology
26
2.2 Gene Ontology
2.2.2 KEGGS
http//www.genome.jp/kegg/
KEGG is a suite of databases and associated
software, integrating our current knowledge on
molecular interaction networks in biological
processes (PATHWAY database), the information
about the universe of genes and proteins
(GENES/SSDB/KO databases), and the information
about the universe of chemical compounds and
reactions (COMPOUND/DRUG/GLYCAN/REACTION
databases). The current statistics of KEGG
databases is as follows Number of pathways
37,209 (PATHWAY database)
27
2.2 Gene Ontology
2.2.2 KEGGS
28
THEME 3Results
3) Results - Microarray and Cancer -
Microarray and microRNA
4 T.R. Golub et alScience, 286, 531-537, (1999)
5 G.A. Calin et alPNAS, 101, 11755-11760,
(2004)
29
3.1 Microarray and Cancer
paper
30
3.2 Microarray and MicroRNA
paper
paper2
31
The End
Write a Comment
User Comments (0)
About PowerShow.com