Title: Functional Genomics and Gene Network Analysis
1Functional Genomics and Gene Network Analysis
- Alexandra Maertens
- Oct. 5, 2004
2Functional Genomics
- The fundamental strategy of functional genomics
is to expand the scope of biological
investigation from studying single genes or
proteins to studying all genes or proteins at
once in a systematic fashion. - Functional genomics seeks to narrow the gap
between sequence and function and to yield new
insights into the behavior of biological
systems. - paraphrased from http//bip.weizmann.ac.il/mb/func
tional_genomics.html
3How to determine functionally related genes?
- 40 of predicted genes in newly sequenced
genomes cannot be assigned function based on
sequence similarity - Other techniques include data across
phylogenetic profiles, looking for gene fusion
events, etc. but all of these techniques are
inexact
4Does co-regulation imply functional similarity?
- Guilt by association genes sharing a common
pattern of expression in many different
experiment are likely to be involved in similar
processes - This technique has been used by Tavazoie et al.
(1999) to identify biologically significant
DNA-motifs in the promoter region of genes
clustered based on cell-cycle expression patterns
in yeast. - Limited applicability for determining functions
of genes in a single species
5Co-Expression Can Be Caused By
- Gene A regulates Gene B, or vice versa
- Or, they are both regulated by a third gene, C
- It can just represent a common response to the
environment of the cell - And lastly it can be an accident
6Determining Co-Regulation in Large Scale Data Sets
- Euclidean Distance
- Pearson
- Spearman
7Euclidean Distance
- Commonly used
- Not practical for analyzing a large data set with
a diverse set of microarrays - Microarrays do not really measure an absolute
amount of mRNA they should be considered as
measuring the relative amount of mRNA - There will be large differences in range of
values for two microarrays done at different
times
8- The Pearson Correlation treats the vectors as if
they were the same (unit) length, and is thus
insensitive to the amplitude of changes that may
be seen in the expression profiles. - Since Euclidean distance measures the absolute
distance between points in space, the Euclidean
distance thus takes into account both the
direction and the magnitude of the vectors. - From Stanford Microarray Database,
9Pearsons
- How well does a linear function describe the
relationship between two variables? - Values ranges between -1 (negative correlation),
0 (no correlation), and 1 (perfect correlation)
10Pearson(courtesy of Hyperstat.com)
11R 0.778184
12Spearman
13Convert each expression value to a value
according to rank in each column (from lowest to
highest)
14- Calculate the difference between the ranks
15Going From Distance Measurements to Networks
- Why do biologists care about networks?
-
16FromPatrik DhaeseleerHarvard
Universityhttp/genetics.med.harvard.edu/patrik
17Yeast Protein Interaction Network
Uetz, Schwikowski, Fields and co-workers Ito and
co-workers
18Image credit U.S. Department of Energy
GenomicsGTL Program, http//doegenomestolife.org
19A network is simply
- A collection of nodes (vertices)
- Connected by edges (links)
From Barbassi, Nature Review Genetics, Vol. 5
Feb, 2004
20From the layout of the network, you can calculate
a lot of interesting properties.
- Degree the connectivity, k, which is the number
of links a node has to its neighbours - Degree distribution P(k) gives the probability
that a node will have a given k number of links
(obtained by counting the number of nodes and
with each value of k and dividing by the total
number of nodes)
21- From this, you can calculate whether the network
is scale-free - This means that P(k) is propertional to k- g
- Usually g is around 2
- This means that the network is organized into a
hub and spoke system with a small number of
nodes having a large number of links
22- If there are N nodes in a network, the number of
possible connections is N2 - Assuming a modest 6,000 genes, that is still more
than we can hope to experimentally determine - However, using correlation between two genes (or
orthologues) across several different experiments
as a stand-in for links, we can use microarrays
to develop a basic network that can then be
refined - The more experiments, and the more types of
experiments, that we find the gene co-expressed,
the more probable they are truly linked.
23Hierarchical clustering vs. non-hierarchical
clustering
- Hierarchical clustering (various agglomerative
and divisive techniques) treat the data as if we
have no idea how many clusters there should be.
24K-means clustering
- the user decides in advance how many clusters
there will be - this is a good method if you a priori know who
many clusters you want (for example, three time
points, you would use three different clusters) - alternatively, if you visually inspect the data
you can see if there appears to be certain number
of clusters - If all else fails, there are computer programs
that calculate an ideal k-cluster out of many
possible values for k - goal is to divide the objects into K clusters
such that some distance metric relative to the
centroids of the clusters is minimized
25- Initial reference vectors are assigned randomly
or according to previous knowledge - Assign each object to one of k clusters randomly
- Calculate average expression vectors for each
cluster (as reference vectors) and the distance
between clusters - Iteratively move objects between clusters and the
objects stay in the new cluster when they are
closer to the new cluster than to the old
cluster. - Repeat steps 3-4 until converge, i.e. moving any
more objects would increase intra-cluster
distances
26For every cluster k, sum the differences between
every data point Xn in the cluster and the
geometric mean of the cluster.
27(No Transcript)
28Finding the Center
                                            Â
so the new cluster centre for A is 19.666 , 21
This process is repeated until successive
iterations cause no overall change in the sum of
distances of each point in each cluster from the
center. (adapted from http//www.ucl.ac.uk/oncolog
y/MicroCore/HTML_resource/KMeans_eq_popup.htm
29http//www-2.cs.cmu.edu/awm/tutorials/kmeans10.pd
f
30(No Transcript)
31(No Transcript)
32- Remember
-
- Clustering always works.
-
- There is no guarantee it is the optimal
partition.
33- Analyze coexpression relationships of homologous
sets of genes in human, fly, worm, yeast to
identify conserved genetic modules - 3182 microarrays over multiple groups of
homologous genes (metagenes)
34- Metagenes
- Orthologous counterparts in different organisms
- Best reciprocal BLAST hit
- Each gene assigned to at least one metagene
35BLAST
- stands for Basic Local Alignment Search Tool
- Performs a pairwise alignment of two strings
(either nucleotides or amino acids) - Amino acids are scored according to their
similarity chemical similarity and observed
mutation frequencies - The probability that this match could have
occurred by chance is given as an E value
36A Metagene
37Global View of the Data Set
38- They then identified pairs of genes by
relabeling each gene in each species with its
metagene, and then comparing expression levels
across each different array. This provided a
Pearson correlation.
39Are the data sufficient?
- Divided the data set randomly into half
- Used each half independently to generate a
network and compared this to the network
generated by the total data to see how many
interactions were maintained at the same level of
statistical significance - Only 40 of the interactions observed were
statistically significant in both halves
40- You can interpret this number two ways
- 1) The approach is sensitive to the amount of
data used - 2) However, even with only half of the data,
there was still some signal amidst the noise
41How robust is the constructed network to noise?
- They added increasing amouts of Gaussian noise
and found that the network was robust with
realistic levels of noise seen in microarray
experiments
42- Permutating your data preserves the same
distribution of your data, but without the labels - They claim that they randomly permutated gene
expression data. This is a bit ambiguous - Probably means they shuffled the expression data
within the species column, so the consistency of
the metagenes would cease to hold throughout the
row - They repeated this shuffling 10 times. The point
of shuffling is to get a representative sample of
the ways you could rearrange your objects 10 is
a conservative number for this instance.
43(No Transcript)
44Using Several Different Species
- By using more species, are you just providing
more data or is there a benefit to looking at
conservation of coexpression across several
distantly related species
45Using Several Different Species
46How to go from this data to a network?
- For each metagene, m1, they ranked each other
instance of the metagene in each species by how
well they correlate, and then compared this
across species, producing a rank ratios each
specis for each directed pair - They calculated a joint probability distribution
based on order statistics to determine if these
rankings were statistically significant.
47- They then used this p-value to do to things
- if the p-value was above a cutoff (corrected for
multiple tests) this defined a link - The strength of the p-value then provided a
distance metric that was then used visualize the
grouping of the data - From visual inspection of this data, the ydecided
upon 12 clusters, and then performed k-means
clustering
48(No Transcript)
49Do these clusters effectively group functionally
related genes?
- The Gene Ontology (or GO) terminology provides a
controlled vocabulary to describe protein and
gene function, cellular location and biological
process - You can calculate whether you have an
overrepresentation of certain GO terms in any
given cluster to characterize the cluster - The P-value is derived from a a hypergeometric
distribution (sampling without replacement), as
the probability of x or more out of n genes
having a given annotation, given that G of N have
that annotation in the genome in general.
(Probably done by GOFinder)
50Do these clusters effectively group functionally
related genes?
51- However, this is a bit suspicious.
- If you look in their supplemental material, you
find metagenes that were further than d units
away from their closest center were excluded from
membership in any of the 12 components. We chose
d to be 10 of the diameter of the entire
landscape. - How many were excluded?
52But are these predictions biologically valid?
- Finding these metagenes in cancer data does
imply that they are connected to cell
proliferation, but it is weak proof that they are
actually cell proliferation genes
53But are these predictions biologically valid?
- RNAi involves feeding small double stranded
oligonucleotides into cells, tricks cells into
degrading its own mRNA. - New technique for quickly knocking out or
knocking down gene expression.
54Worm gonads stained for DNA wt-type worms show
less nucei
55- Not a properly done RNAi experiment
- How much does one experiment tell you? (Did they
have to do 10 RNAi experiments based on their
predictions before one worked?)
56Does this network look like a biological network?
- Count the number of links of each metagene (with
links being defined by the number of other
metagenes with coexpressed for a given P-value) - See if the distribution of links is non-random
and different from a network generated from
random data
57Does this network look like a biological network?
58- Does this likely reflect the actual distribution
of links, or does this data by definition select
highly linked genes?
59What are the limitations?
- Is has been estimated that in humans 95 of mRNA
transcripts are expressed at
lt5 copies/cell
Velculescu et al.(1997), Cell 88243-251