Functional Genomics and Gene Network Analysis - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Functional Genomics and Gene Network Analysis

Description:

Rank Values of Lead Detox Gene. Calculate the difference between the ranks. 4 ... Rank Values of Lead Detox Gene. Going From Distance Measurements to Networks ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 60
Provided by: TAD62
Category:

less

Transcript and Presenter's Notes

Title: Functional Genomics and Gene Network Analysis


1
Functional Genomics and Gene Network Analysis
  • Alexandra Maertens
  • Oct. 5, 2004

2
Functional Genomics
  • The fundamental strategy of functional genomics
    is to expand the scope of biological
    investigation from studying single genes or
    proteins to studying all genes or proteins at
    once in a systematic fashion.
  • Functional genomics seeks to narrow the gap
    between sequence and function and to yield new
    insights into the behavior of biological
    systems.
  • paraphrased from http//bip.weizmann.ac.il/mb/func
    tional_genomics.html

3
How to determine functionally related genes?
  • 40 of predicted genes in newly sequenced
    genomes cannot be assigned function based on
    sequence similarity
  • Other techniques include data across
    phylogenetic profiles, looking for gene fusion
    events, etc. but all of these techniques are
    inexact

4
Does co-regulation imply functional similarity?
  • Guilt by association genes sharing a common
    pattern of expression in many different
    experiment are likely to be involved in similar
    processes
  • This technique has been used by Tavazoie et al.
    (1999) to identify biologically significant
    DNA-motifs in the promoter region of genes
    clustered based on cell-cycle expression patterns
    in yeast.
  • Limited applicability for determining functions
    of genes in a single species

5
Co-Expression Can Be Caused By
  • Gene A regulates Gene B, or vice versa
  • Or, they are both regulated by a third gene, C
  • It can just represent a common response to the
    environment of the cell
  • And lastly it can be an accident

6
Determining Co-Regulation in Large Scale Data Sets
  • Euclidean Distance
  • Pearson
  • Spearman

7
Euclidean Distance
  • Commonly used
  • Not practical for analyzing a large data set with
    a diverse set of microarrays
  • Microarrays do not really measure an absolute
    amount of mRNA they should be considered as
    measuring the relative amount of mRNA
  • There will be large differences in range of
    values for two microarrays done at different
    times

8
  • The Pearson Correlation treats the vectors as if
    they were the same (unit) length, and is thus
    insensitive to the amplitude of changes that may
    be seen in the expression profiles.
  • Since Euclidean distance measures the absolute
    distance between points in space, the Euclidean
    distance thus takes into account both the
    direction and the magnitude of the vectors.
  • From Stanford Microarray Database,

9
Pearsons
  • How well does a linear function describe the
    relationship between two variables?
  • Values ranges between -1 (negative correlation),
    0 (no correlation), and 1 (perfect correlation)

10
Pearson(courtesy of Hyperstat.com)
11
R 0.778184
12
Spearman
  • from www.mathworld.com

13
Convert each expression value to a value
according to rank in each column (from lowest to
highest)
14
  • Calculate the difference between the ranks

15
Going From Distance Measurements to Networks
  • Why do biologists care about networks?

16
FromPatrik DhaeseleerHarvard
Universityhttp/genetics.med.harvard.edu/patrik

17
Yeast Protein Interaction Network
Uetz, Schwikowski, Fields and co-workers Ito and
co-workers
18
Image credit U.S. Department of Energy
GenomicsGTL Program, http//doegenomestolife.org
19
A network is simply
  • A collection of nodes (vertices)
  • Connected by edges (links)

From Barbassi, Nature Review Genetics, Vol. 5
Feb, 2004
20
From the layout of the network, you can calculate
a lot of interesting properties.
  • Degree the connectivity, k, which is the number
    of links a node has to its neighbours
  • Degree distribution P(k) gives the probability
    that a node will have a given k number of links
    (obtained by counting the number of nodes and
    with each value of k and dividing by the total
    number of nodes)

21
  • From this, you can calculate whether the network
    is scale-free
  • This means that P(k) is propertional to k- g
  • Usually g is around 2
  • This means that the network is organized into a
    hub and spoke system with a small number of
    nodes having a large number of links

22
  • If there are N nodes in a network, the number of
    possible connections is N2
  • Assuming a modest 6,000 genes, that is still more
    than we can hope to experimentally determine
  • However, using correlation between two genes (or
    orthologues) across several different experiments
    as a stand-in for links, we can use microarrays
    to develop a basic network that can then be
    refined
  • The more experiments, and the more types of
    experiments, that we find the gene co-expressed,
    the more probable they are truly linked.

23
Hierarchical clustering vs. non-hierarchical
clustering
  • Hierarchical clustering (various agglomerative
    and divisive techniques) treat the data as if we
    have no idea how many clusters there should be.

24
K-means clustering
  • the user decides in advance how many clusters
    there will be
  • this is a good method if you a priori know who
    many clusters you want (for example, three time
    points, you would use three different clusters)
  • alternatively, if you visually inspect the data
    you can see if there appears to be certain number
    of clusters
  • If all else fails, there are computer programs
    that calculate an ideal k-cluster out of many
    possible values for k
  • goal is to divide the objects into K clusters
    such that some distance metric relative to the
    centroids of the clusters is minimized

25
  • Initial reference vectors are assigned randomly
    or according to previous knowledge
  • Assign each object to one of k clusters randomly
  • Calculate average expression vectors for each
    cluster (as reference vectors) and the distance
    between clusters
  • Iteratively move objects between clusters and the
    objects stay in the new cluster when they are
    closer to the new cluster than to the old
    cluster.
  • Repeat steps 3-4 until converge, i.e. moving any
    more objects would increase intra-cluster
    distances

26
For every cluster k, sum the differences between
every data point Xn in the cluster and the
geometric mean of the cluster.
27
(No Transcript)
28
Finding the Center
                                             
so the new cluster centre for A is 19.666 , 21
This process is repeated until successive
iterations cause no overall change in the sum of
distances of each point in each cluster from the
center. (adapted from http//www.ucl.ac.uk/oncolog
y/MicroCore/HTML_resource/KMeans_eq_popup.htm
29
http//www-2.cs.cmu.edu/awm/tutorials/kmeans10.pd
f
30
(No Transcript)
31
(No Transcript)
32
  • Remember
  • Clustering always works.
  • There is no guarantee it is the optimal
    partition.

33
  • Analyze coexpression relationships of homologous
    sets of genes in human, fly, worm, yeast to
    identify conserved genetic modules
  • 3182 microarrays over multiple groups of
    homologous genes (metagenes)

34
  • Metagenes
  • Orthologous counterparts in different organisms
  • Best reciprocal BLAST hit
  • Each gene assigned to at least one metagene

35
BLAST
  • stands for Basic Local Alignment Search Tool
  • Performs a pairwise alignment of two strings
    (either nucleotides or amino acids)
  • Amino acids are scored according to their
    similarity chemical similarity and observed
    mutation frequencies
  • The probability that this match could have
    occurred by chance is given as an E value

36
A Metagene
37
Global View of the Data Set
38
  • They then identified pairs of genes by
    relabeling each gene in each species with its
    metagene, and then comparing expression levels
    across each different array. This provided a
    Pearson correlation.

39
Are the data sufficient?
  • Divided the data set randomly into half
  • Used each half independently to generate a
    network and compared this to the network
    generated by the total data to see how many
    interactions were maintained at the same level of
    statistical significance
  • Only 40 of the interactions observed were
    statistically significant in both halves

40
  • You can interpret this number two ways
  • 1) The approach is sensitive to the amount of
    data used
  • 2) However, even with only half of the data,
    there was still some signal amidst the noise

41
How robust is the constructed network to noise?
  • They added increasing amouts of Gaussian noise
    and found that the network was robust with
    realistic levels of noise seen in microarray
    experiments

42
  • Permutating your data preserves the same
    distribution of your data, but without the labels
  • They claim that they randomly permutated gene
    expression data. This is a bit ambiguous
  • Probably means they shuffled the expression data
    within the species column, so the consistency of
    the metagenes would cease to hold throughout the
    row
  • They repeated this shuffling 10 times. The point
    of shuffling is to get a representative sample of
    the ways you could rearrange your objects 10 is
    a conservative number for this instance.

43
(No Transcript)
44
Using Several Different Species
  • By using more species, are you just providing
    more data or is there a benefit to looking at
    conservation of coexpression across several
    distantly related species

45
Using Several Different Species
46
How to go from this data to a network?
  • For each metagene, m1, they ranked each other
    instance of the metagene in each species by how
    well they correlate, and then compared this
    across species, producing a rank ratios each
    specis for each directed pair
  • They calculated a joint probability distribution
    based on order statistics to determine if these
    rankings were statistically significant.

47
  • They then used this p-value to do to things
  • if the p-value was above a cutoff (corrected for
    multiple tests) this defined a link
  • The strength of the p-value then provided a
    distance metric that was then used visualize the
    grouping of the data
  • From visual inspection of this data, the ydecided
    upon 12 clusters, and then performed k-means
    clustering

48
(No Transcript)
49
Do these clusters effectively group functionally
related genes?
  • The Gene Ontology (or GO) terminology provides a
    controlled vocabulary to describe protein and
    gene function, cellular location and biological
    process
  • You can calculate whether you have an
    overrepresentation of certain GO terms in any
    given cluster to characterize the cluster
  • The P-value is derived from a a hypergeometric
    distribution (sampling without replacement), as
    the probability of x or more out of n genes
    having a given annotation, given that G of N have
    that annotation in the genome in general.
    (Probably done by GOFinder)

50
Do these clusters effectively group functionally
related genes?
51
  • However, this is a bit suspicious.
  • If you look in their supplemental material, you
    find metagenes that were further than d units
    away from their closest center were excluded from
    membership in any of the 12 components. We chose
    d to be 10 of the diameter of the entire
    landscape.
  • How many were excluded?

52
But are these predictions biologically valid?
  • Finding these metagenes in cancer data does
    imply that they are connected to cell
    proliferation, but it is weak proof that they are
    actually cell proliferation genes

53
But are these predictions biologically valid?
  • RNAi involves feeding small double stranded
    oligonucleotides into cells, tricks cells into
    degrading its own mRNA.
  • New technique for quickly knocking out or
    knocking down gene expression.

54
Worm gonads stained for DNA wt-type worms show
less nucei
55
  • Not a properly done RNAi experiment
  • How much does one experiment tell you? (Did they
    have to do 10 RNAi experiments based on their
    predictions before one worked?)

56
Does this network look like a biological network?
  • Count the number of links of each metagene (with
    links being defined by the number of other
    metagenes with coexpressed for a given P-value)
  • See if the distribution of links is non-random
    and different from a network generated from
    random data

57
Does this network look like a biological network?
58
  • Does this likely reflect the actual distribution
    of links, or does this data by definition select
    highly linked genes?

59
What are the limitations?
  • Is has been estimated that in humans 95 of mRNA
    transcripts are expressed at
    lt5 copies/cell
    Velculescu et al.(1997), Cell 88243-251
Write a Comment
User Comments (0)
About PowerShow.com