Functional Genomics and Gene Network Analysis - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Functional Genomics and Gene Network Analysis

Description:

Rank Values of Lead Detox Gene. Calculate the difference between the ranks. 4 ... Rank Values of Lead Detox Gene. Going From Distance Measurements to Networks ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 60

Provided by: TAD62

Category:

more less

Transcript and Presenter's Notes

Title: Functional Genomics and Gene Network Analysis

1
Functional Genomics and Gene Network Analysis

Alexandra Maertens
Oct. 5, 2004

2
Functional Genomics

The fundamental strategy of functional genomics
is to expand the scope of biological
investigation from studying single genes or
proteins to studying all genes or proteins at
once in a systematic fashion.
Functional genomics seeks to narrow the gap
between sequence and function and to yield new
insights into the behavior of biological
systems.
paraphrased from http//bip.weizmann.ac.il/mb/func
tional_genomics.html

3
How to determine functionally related genes?

40 of predicted genes in newly sequenced
genomes cannot be assigned function based on
sequence similarity
Other techniques include data across
phylogenetic profiles, looking for gene fusion
events, etc. but all of these techniques are
inexact

4
Does co-regulation imply functional similarity?

Guilt by association genes sharing a common
pattern of expression in many different
experiment are likely to be involved in similar
processes
This technique has been used by Tavazoie et al.
(1999) to identify biologically significant
DNA-motifs in the promoter region of genes
clustered based on cell-cycle expression patterns
in yeast.
Limited applicability for determining functions
of genes in a single species

5
Co-Expression Can Be Caused By

Gene A regulates Gene B, or vice versa
Or, they are both regulated by a third gene, C
It can just represent a common response to the
environment of the cell
And lastly it can be an accident

6
Determining Co-Regulation in Large Scale Data Sets

Euclidean Distance
Pearson
Spearman

7
Euclidean Distance

Commonly used
Not practical for analyzing a large data set with
a diverse set of microarrays
Microarrays do not really measure an absolute
amount of mRNA they should be considered as
measuring the relative amount of mRNA
There will be large differences in range of
values for two microarrays done at different
times

The Pearson Correlation treats the vectors as if
they were the same (unit) length, and is thus
insensitive to the amplitude of changes that may
be seen in the expression profiles.
Since Euclidean distance measures the absolute
distance between points in space, the Euclidean
distance thus takes into account both the
direction and the magnitude of the vectors.
From Stanford Microarray Database,

9
Pearsons

How well does a linear function describe the
relationship between two variables?
Values ranges between -1 (negative correlation),
0 (no correlation), and 1 (perfect correlation)

10
Pearson(courtesy of Hyperstat.com)
11
R 0.778184
12
Spearman

from www.mathworld.com

13
Convert each expression value to a value
according to rank in each column (from lowest to
highest)
14

Calculate the difference between the ranks

15
Going From Distance Measurements to Networks

Why do biologists care about networks?

16
FromPatrik DhaeseleerHarvard
Universityhttp/genetics.med.harvard.edu/patrik

17
Yeast Protein Interaction Network
Uetz, Schwikowski, Fields and co-workers Ito and
co-workers
18
Image credit U.S. Department of Energy
GenomicsGTL Program, http//doegenomestolife.org
19
A network is simply

A collection of nodes (vertices)
Connected by edges (links)

From Barbassi, Nature Review Genetics, Vol. 5
Feb, 2004
20
From the layout of the network, you can calculate
a lot of interesting properties.

Degree the connectivity, k, which is the number
of links a node has to its neighbours
Degree distribution P(k) gives the probability
that a node will have a given k number of links
(obtained by counting the number of nodes and
with each value of k and dividing by the total
number of nodes)

From this, you can calculate whether the network
is scale-free
This means that P(k) is propertional to k- g
Usually g is around 2
This means that the network is organized into a
hub and spoke system with a small number of
nodes having a large number of links

If there are N nodes in a network, the number of
possible connections is N2
Assuming a modest 6,000 genes, that is still more
than we can hope to experimentally determine
However, using correlation between two genes (or
orthologues) across several different experiments
as a stand-in for links, we can use microarrays
to develop a basic network that can then be
refined
The more experiments, and the more types of
experiments, that we find the gene co-expressed,
the more probable they are truly linked.

23
Hierarchical clustering vs. non-hierarchical
clustering

Hierarchical clustering (various agglomerative
and divisive techniques) treat the data as if we
have no idea how many clusters there should be.

24
K-means clustering

the user decides in advance how many clusters
there will be
this is a good method if you a priori know who
many clusters you want (for example, three time
points, you would use three different clusters)
alternatively, if you visually inspect the data
you can see if there appears to be certain number
of clusters
If all else fails, there are computer programs
that calculate an ideal k-cluster out of many
possible values for k
goal is to divide the objects into K clusters
such that some distance metric relative to the
centroids of the clusters is minimized

Initial reference vectors are assigned randomly
or according to previous knowledge
Assign each object to one of k clusters randomly
Calculate average expression vectors for each
cluster (as reference vectors) and the distance
between clusters
Iteratively move objects between clusters and the
objects stay in the new cluster when they are
closer to the new cluster than to the old
cluster.
Repeat steps 3-4 until converge, i.e. moving any
more objects would increase intra-cluster
distances

26
For every cluster k, sum the differences between
every data point Xn in the cluster and the
geometric mean of the cluster.
27
(No Transcript)
28
Finding the Center

so the new cluster centre for A is 19.666 , 21
This process is repeated until successive
iterations cause no overall change in the sum of
distances of each point in each cluster from the
center. (adapted from http//www.ucl.ac.uk/oncolog
y/MicroCore/HTML_resource/KMeans_eq_popup.htm
29
http//www-2.cs.cmu.edu/awm/tutorials/kmeans10.pd
f
30
(No Transcript)
31
(No Transcript)
32

Remember
Clustering always works.
There is no guarantee it is the optimal
partition.

Analyze coexpression relationships of homologous
sets of genes in human, fly, worm, yeast to
identify conserved genetic modules
3182 microarrays over multiple groups of
homologous genes (metagenes)

Metagenes
Orthologous counterparts in different organisms
Best reciprocal BLAST hit
Each gene assigned to at least one metagene

35
BLAST

stands for Basic Local Alignment Search Tool
Performs a pairwise alignment of two strings
(either nucleotides or amino acids)
Amino acids are scored according to their
similarity chemical similarity and observed
mutation frequencies
The probability that this match could have
occurred by chance is given as an E value

36
A Metagene
37
Global View of the Data Set
38

They then identified pairs of genes by
relabeling each gene in each species with its
metagene, and then comparing expression levels
across each different array. This provided a
Pearson correlation.

39
Are the data sufficient?

Divided the data set randomly into half
Used each half independently to generate a
network and compared this to the network
generated by the total data to see how many
interactions were maintained at the same level of
statistical significance
Only 40 of the interactions observed were
statistically significant in both halves

You can interpret this number two ways
1) The approach is sensitive to the amount of
data used
2) However, even with only half of the data,
there was still some signal amidst the noise

41
How robust is the constructed network to noise?

They added increasing amouts of Gaussian noise
and found that the network was robust with
realistic levels of noise seen in microarray
experiments

Permutating your data preserves the same
distribution of your data, but without the labels
They claim that they randomly permutated gene
expression data. This is a bit ambiguous
Probably means they shuffled the expression data
within the species column, so the consistency of
the metagenes would cease to hold throughout the
row
They repeated this shuffling 10 times. The point
of shuffling is to get a representative sample of
the ways you could rearrange your objects 10 is
a conservative number for this instance.

43
(No Transcript)
44
Using Several Different Species

By using more species, are you just providing
more data or is there a benefit to looking at
conservation of coexpression across several
distantly related species

45
Using Several Different Species
46
How to go from this data to a network?

For each metagene, m1, they ranked each other
instance of the metagene in each species by how
well they correlate, and then compared this
across species, producing a rank ratios each
specis for each directed pair
They calculated a joint probability distribution
based on order statistics to determine if these
rankings were statistically significant.

They then used this p-value to do to things
if the p-value was above a cutoff (corrected for
multiple tests) this defined a link
The strength of the p-value then provided a
distance metric that was then used visualize the
grouping of the data
From visual inspection of this data, the ydecided
upon 12 clusters, and then performed k-means
clustering

48
(No Transcript)
49
Do these clusters effectively group functionally
related genes?

The Gene Ontology (or GO) terminology provides a
controlled vocabulary to describe protein and
gene function, cellular location and biological
process
You can calculate whether you have an
overrepresentation of certain GO terms in any
given cluster to characterize the cluster
The P-value is derived from a a hypergeometric
distribution (sampling without replacement), as
the probability of x or more out of n genes
having a given annotation, given that G of N have
that annotation in the genome in general.
(Probably done by GOFinder)

50
Do these clusters effectively group functionally
related genes?
51

However, this is a bit suspicious.
If you look in their supplemental material, you
find metagenes that were further than d units
away from their closest center were excluded from
membership in any of the 12 components. We chose
d to be 10 of the diameter of the entire
landscape.
How many were excluded?

52
But are these predictions biologically valid?

Finding these metagenes in cancer data does
imply that they are connected to cell
proliferation, but it is weak proof that they are
actually cell proliferation genes

53
But are these predictions biologically valid?

RNAi involves feeding small double stranded
oligonucleotides into cells, tricks cells into
degrading its own mRNA.
New technique for quickly knocking out or
knocking down gene expression.

54
Worm gonads stained for DNA wt-type worms show
less nucei
55

Not a properly done RNAi experiment
How much does one experiment tell you? (Did they
have to do 10 RNAi experiments based on their
predictions before one worked?)

56
Does this network look like a biological network?

Count the number of links of each metagene (with
links being defined by the number of other
metagenes with coexpressed for a given P-value)
See if the distribution of links is non-random
and different from a network generated from
random data

57
Does this network look like a biological network?
58