Beyond Co-expression: Gene Network Inference - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Beyond Co-expression: Gene Network Inference

Description:

Beyond Co-expression: Gene Network Inference Patrik D haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik – PowerPoint PPT presentation

Number of Views:217
Avg rating:3.0/5.0
Slides: 57
Provided by: pdha
Category:

less

Transcript and Presenter's Notes

Title: Beyond Co-expression: Gene Network Inference


1
Beyond Co-expressionGene Network Inference
  • Patrik Dhaeseleer
  • Harvard University
  • http/genetics.med.harvard.edu/patrik

2
Beyond Co-expression
  • Clustering approaches rely on co-expression of
    genes under different conditions
  • Assumes co-expression is caused by co-regulation
  • We would like to do better than that
  • Causal inference
  • What is regulating what?

3
Gene Network Inference
4
Overview
  • Modeling Issues
  • Level of biochemical detail
  • Boolean or continuous?
  • Deterministic or stochastic?
  • Spatial or non-spatial?
  • Data Requirements
  • Linear Models
  • Nonlinear models
  • Conclusions

5
Level of Biochemical Detail
  • Detailed models require lots of data!
  • Highly detailed biochemical models are only
    feasible for very small systems which are
    extensively studied
  • Example Arkin et al. (1998), Genetics
    149(4)1633-48
  • lysis-lysogeny switch in Lambda
  • 5 genes, 67 parameters based on 50 years of
    research, stochastic simulation required
    supercomputer

6
Example Lysis-Lysogeny
Arkin et al. (1998), Genetics 149(4)1633-48
7
Level of Biochemical Detail
  • In-depth biochemical simulation of e.g. a whole
    cell is infeasible (so far)
  • Less detailed network models are useful when data
    is scarce and/or network structure is unknown
  • Once network structure has been determined, we
    can refine the model

8
Boolean or Continuous?
  • Boolean Networks (Kauffman (1993), The Origins of
    Order) assumes ON/OFF gene states.
  • Allows analysis at the network-level
  • Provides useful insights in network dynamics
  • Algorithms for network inference from binary data

9
Boolean or Continuous?
  • Boolean abstraction is poor fit to real data
  • Cannot model important concepts
  • amplification of a signal
  • subtraction and addition of signals
  • compensating for smoothly varying environmental
    parameter (e.g. temperature, nutrients)
  • varying dynamical behavior (e.g. cell cycle
    period)
  • Feedback control
  • negative feedback is used to stabilize expression
  • ?? causes oscillation in Boolean model

10
Deterministic or Stochastic?
  • Use of concentrations assumes individual
    molecules can be ignored
  • Known examples (in prokaryotes) where stochastic
    fluctuations play an essential role (e.g.
    lysis-lysogeny in lambda)
  • Requires stochastic simulation (Arkin et al.
    (1998), Genetics 149(4)1633-48), or modeling
    molecule counts (e.g. Petri nets, Goss and
    Peccoud (1998), PNAS 95(12)6750-5)
  • Significantly increases model complexity

11
Deterministic or Stochastic?
  • Eukaryotes larger cell volume, typically longer
    half-lives. Few known stochastic effects.
  • Yeast 80 of the transcriptome
    is expressed at 0.1-2 mRNA
    copies/cell
    Holstege, et al.(1998),
    Cell 95717-728.
  • Human 95 of transcriptome is
    expressed at lt5 copies/cell
    Velculescu et al.(1997), Cell 88243-251

12
Spatial or Non-Spatial
  • Spatiality introduces additional complexity
  • intercellular interactions
  • spatial differentiation
  • cell compartments
  • cell types
  • Spatial patterns also provide more data
  • e.g. stripe formation in Drosophila
  • Mjolsness et al. (1991), J. Theor. Biol. 152
    429-454.
  • Few (no?) large-scale spatial gene expression
    data sets available so far.

13
Overview
  • Modeling Issues
  • Level of biochemical detail
  • Boolean or continuous?
  • Deterministic or stochastic?
  • Spatial or non-spatial?
  • Data Requirements
  • Linear Models
  • Nonlinear models
  • Conclusions

14
Overview
  • Modeling Issues
  • Data Requirements
  • Lower bounds from information theory
  • Effect of limited connectivity
  • Comparison with clustering
  • Variety of data points needed
  • Linear Models
  • Nonlinear models
  • Conclusions

15
Lower Bounds from Information Theory
  • How many bits of information are needed just to
    specify the connection pattern of a network?
  • N2 possible connections between N nodes
  • ? N2 bits needed to specify which connections
    are present or absent
  • O(N) bits of information per data point
  • ? O(N) data points needed

16
Effect of Limited Connectivity
  • Assume only K inputs per gene (on average)
  • ? NK connections out of N2 possible
  • possible connection patterns
  • Number of bits needed to fully specify the
    connection pattern
  • ? O(Klog(N/K)) data points needed

17
Comparison with clustering
  • Use pairwise correlation comparisons as a
    stand-in for clustering
  • As number of genes increases, number of false
    positives will increase as well ? need to use
    more stringent correlation test
  • If we want to use the same correlation cutoff
    value r, we need to increase the number of data
    points as N increases
  • ? O(log(N)) data points needed

18
Summary
  • Fully connected N (thousands)
  • Connectivity K Klog(N/K) (hundreds?)
  • Clustering log(N) (tens)
  • Additional constraints reduce data requirements
  • limited connectivity
  • choice of regulatory functions
  • Network inference is feasible, but does require
    much more data than clustering

19
Variety of Data Points Needed
  • To unravel regulation of a gene, need to sample
    many different combinations of its regulatory
    inputs (different environmental conditions and
    perturbations)
  • Time series data yields dynamics, but all data
    points are related
  • Steady-state data yields attractors, can sample
    state space more efficiently
  • Both types of data will be needed, and multiple
    data sets of each

20
Overview
  • Modeling Issues
  • Data Requirements
  • Linear Models
  • Formulation
  • Underdetermined problem!
  • Solution 1 reduce N
  • Solution 2
  • Nonlinear models
  • Conclusions

21
Linear Models
  • Basic model weighted sum of inputs
  • Simple network representation
  • Only first-order approximation
  • Parameters of the model
    weight matrix containing NxN interaction
    weights
  • Fitting the model find the parameters wji, bi
    such that model best fits available data

22
Underdetermined problem!
  • Assumes fully connected network need at least as
    many data points (arrays, conditions) as
    variables (genes)!
  • Underdetermined (underconstrained, ill-posed)
    model we have many more parameters than data
    values to fit
  • No single solution, rather infinite number of
    parameter settings that will all fit the data
    equally well

23
Solution 1 reduce N
  • Rather than trying to model all genes, we can
    reduce the dimensionality of the problem
  • Network of clusters construct a linear model
    based on the cluster centroids
  • rat CNS data (4 clusters) Wahde and Hertz
    (2000), Biosystems 55, 1-3129-136.
  • yeast cell cycle (15-18 clusters) Mjolsness et
    al.(2000), Advances in Neural Information
    Processing Systems 12 van Someren et al.(2000)
    ISMB2000, 355-366.
  • Network of Principle Components linear model
    between characteristic modes of the data
  • Holter et al.(2001), PNAS 98(4)1693-1698.

24
Solution 2
  • Take advantage of additional information
  • replicates
  • accuracy of measurements
  • smoothness of time series
  • Most likely, the network will still be poorly
    constrained.
  • ? Need a method to identify and extract those
    parts of the model that are well-determined and
    robust

25
Whats next?
  • Regulatory motifs
  • once we have identified the corresponding DNA
    binding
  • proteins (transcription factors), we can start
    building the
  • gene network from there
  • Integration with other data
  • transcription factors
  • functional annotation
  • known interactions in the literature
  • protein-protein interactions
  • protein expression levels
  • genetic data
  • ...

26
Linking Regulatory Motifs to Expression Data
  • Patrik Dhaeseleer
  • Harvard University
  • http/genetics.med.harvard.edu/patrik

27
Introduction
  • Gene expression is regulated by Transcription
    Factors (TFs), that bind to specific regulatory
    motifs in the promoter region of the gene.
  • Synonyms regulatory element, regulatory
    sequence, promoter elements, promoter motifs,
    (TF) binding site, operator (in prokaryotes),
  • Question Do genes with similar expression
    patterns share regulatory motifs?

28
1 Systematic Determination of Genetic Network
Architecture
Time-point 1
Tavazoie et al., Nature Genetics 22, 281 285
(1999)
Normalized Expression
Time-point 3
Time -point
Time-point 2
Normalized Expression
Normalized Expression
Time -point
Time -point
29
Search for Motifs in Promoter Regions
5- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTC
ATGAGAAAAGAGTCAGACATCGAAACATACAT
HIS7
ARO4
5- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTG
CATTTTGTACGTTACTGCGAAATGACTCAACG
ILV6
5- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTC
GCATCGCCGAAGTGCCATAAAAAATATTTTTT
5- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAA
AATTTTCGACAAAATGTATAGTCATTTCTATC
THR4
ARO1
5- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAAT
TGTCATGCATATGACTCATCCCGAACATGAAA
5- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTA
GAGAAAAATAGAAAAGCAGAAAAAATAAATAA
HOM2
5- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTC
TTTTTTGGAAAGTGTGGCATGTGCTTCACACA
PRO3
300-600 bp of upstream sequence per gene are
searched in Saccharomyces cerevisiae.
30
Best Motif Found by AlignACE
31
N182
32
Systematic Determination of Genetic Network
Architecture
  • Tavazoie et al., Nature Genetics 22, 281285
    (1999)
  • Most motifs found are highly selective for the
    cluster they were found in.
  • Can find many known binding sites for
    transcription factors.
  • Also finds many novel regulatory motifs,
    associated with specific functional categories.
  • 1) cluster
  • 2) identify regulatory motifs in clustered genes
  • 3) identify TFs that bind to those motifs
  • ? Gene regulation network

33
2 Regulatory Element Detection Using Correlation
with Expression
Bussemaker et al., Nature Genetics 27, 167 174
(2001)
  • What is the contribution of each regulatory motif
    (or the TF that binds to that motif) to the
    expression level of the genes containing the
    motif?
  • Given a set of known or putative regulatory
    motifs, identify all genes that contain the motif
    in their promoter region.
  • For a single expression experiment (e.g. single
    point in a time series), is the presence of the
    motif correlated with the expression level of the
    genes?
  • Perform multiple regression of (log) expression
    level on the presence/absence of the motifs.
  • Plot contribution of motif throughout time series.

34
Contribution of Motifs to Expression Levels
35
Linear Combination of Motif Contributions
  • Find the most highly correlated motif.
  • Determine its contribution Fi to expression level
    by linear regression.
  • Subtract its contribution from the expression
    levels.
  • Find the next highest correlated motif.
  • Repeat until no more significantly correlated
    motifs.
  • Repeat this entire analysis for each time point
    of a time series ? weights Fi for the individual
    motifs will change throughout he time course.

36
Time Courses of Regulatory Signals
  • We can think of the time-varying contributions Fi
    of each motif as the Regulatory Signals of the
    transcription factors that bind to these motifs

37
Regulatory Element Detection Using Correlation
with Expression
  • Bussemaker et al., Nature Genetics 27, 167174
    (2001)
  • Can be used with known regulatory motifs, sets of
    putative motifs, and even exhaustively on the set
    of all motifs up to a certain length (n7).
  • Known motifs generally have high statistical
    significance.
  • Allows us to infer regulatory inputs of (possibly
    unknown) transcription factors.
  • Accounts for only 30 of total signal present in
    genome-wide expression patterns.
  • Purely linear model no synergistic effects
    between TFs, cooperative binding, etc.

38
3 Identifying Regulatory Networks by
Combinatorial Analysis of Promoter Elements
Pilpel et al., Nature Genetics 29, 153159 (2001)
  • Most transcription factors are thought to work in
    concert with other TFs.
  • ? Synergistic effects
  • Clustering
  • a motif may occur in more than one cluster,
    because it may give rise to different expression
    patterns depending on its interaction partners.
  • several motifs may occur in the same cluster.
  • Correlation with expression pattern
  • by itself, a motif may not show a clear
    expression pattern.
  • contributions of multiple motifs may not be
    simply additive.

39
Synergy between Mcm1 and SFF in Cell Cycle Data
Set
Mcm1 and SFF were not detected in Tavazoie et
al Yet TFs that bind these motifs are known to
interact in control of G2-genes (Nature. 2000
40690-4.) Bussemaker et al found that these
motifs are antagonistic.
40
Expression Coherence and Synergy
  • Expression Coherence (EC) score indicates how
    tightly clustered the expression profiles of a
    set of genes are.
  • For every combination of N2,3 motifs

1) Calculate the expression coherence score of
the genes that have the N motifs 2) Calculate the
expression coherence score of genes that have
every possible subset of N-1 motifs 3 )Test
(statistically) the hypothesis that the score of
the orfs with N motifs is significantly higher
than that of orfs that have any sub set of N-1
motifs
41
The Combinogram
Highly synergistic interaction between MCB and
SFF Previously unknown Subsequently predicted via
chromatin immuno-precipitation (ChIP)
(cell cycle data)
42
Identifying Regulatory Networks by Combinatorial
Analysis of Promoter Elements
  • Pilpel et al., Nature Genetics 29, 153159 (2001)
  • Found several known and novel interactions
    between regulatory elements active in cell cycle,
    sporulation and stress response.
  • Doesnt assume a specific (e.g. linear) model of
    TF interactions.
  • Combined with TF expression patterns, may allow
    us to infer a model of interaction.

43
Protein Networks
  • Patrik Dhaeseleer
  • Harvard University
  • http/genetics.med.harvard.edu/patrik

44
Yeast 2-Hybrid Assays
Transcription Factor (e.g. Gal4)
MATa
MATa

Reconstituted active TF
bait fusion
prey fusion
Fields and Song, Nature 340245-246 (1989)
45
Large-Scale 2-Hybrid Data Sets
  • Uetz et al, Nature 403623-627 (2000)
  • 6000 x 192 protein pairs screened using protein
    array
  • nearly all 6000 x 6000 pairs, using pooled prey
    libraries
  • total of 957 putative interactions between 1004
    proteins
  • Ito et al, PNAS 984569-4574 (2001)
  • nearly all 6000 x 6000 pairs, using bait and prey
    pools
  • total of 4549 putative interactions between 3278
    proteins
  • core set of 841 interactions between 797 proteins
  • Surprisingly little overlap between the data
    sets, possibly indicating a large number of
    missed interactions (false negative).

46
Intersections between Protein Interaction Data
Sets
MIPS 1546
MIPS 1546
Ito full 4475
1415
1436
Ito core 806
49
28
54
28
21
61
648
109
156
709
4242
756
Uetz 947
Uetz 947
47
Causes of False Positives
  • Bait acts as activator
  • Bait interacts with endogenous activator
  • Prey binds to DNA
  • Prey interacts with endogenous transcription
    factors
  • Bait interacts with Activation Domain
  • Prey interacts with DNA Binding Domain
  • Sticky proteins (nonspecific binding)
  • Changes in plasmid copy number
  • Various other artifacts
  • . . .

48
Yeast Protein-Protein Interaction Map
Each node is a protein Each line is an
interaction 5560 putative interactions 3725
different proteins 3 interactions / protein
Uetz, Schwikowski, Fields and co-workers Ito and
co-workers
49
Membrane Proteins
Transcription Factors
50
- membrane protein - DNA-binding protein - all
other yeast proteins - physical interaction
between two proteins
51
Problem How to Rank Possible Pathways?
Ste2/3
Possible Paths from Ste3 1045 different paths
to 143 transcription factors
Ste12
52
Rank Predicted Paths by Degree of Expression
Correlation from Microarray Expreriments
  • Known pathways often show correlated expression
  • Known interacting proteins often show correlated
    expression

Average Pairwise Correlation Coefficient Among
Pathway Members
STE3?AKR1?STE5?STE4?FAR1?CDC24?SOH1
0.190 STE3?AKR1?IQG1?CDC42?BEM4?RHO1?SKN7
0.059 STE3?AKR1?STE4?FAR1?FUS3?DIG2?STE12
0.281 STE3?AKR1?GCS1?YGL198W?SAS10?NET1 -0.106
Microarray Data Downloaded from Rosetta
Inpharmatics
53
Classical View of MAPK Pathways
adapted from C.Roberts, et al., Science, 287, 873
(2000)
54
The Protein Network View
  • Highly interconnected, not just a linear pathway!
  • Some proteins are missing from the protein
    interaction data sets (Cdc42, Ste20).
  • Includes several additional proteins (especially
    Akr1, Kss1).

55
Conclusions
  • Protein interaction data and expression data are
    both noisy. Combining them increases the
    accuracy.
  • Can estimate protein interaction error rates by
    looking at consistency between data sets ?
    probabilistic interaction model (work in
    progress).
  • Pathways are far more interconnected than often
    portrayed.
  • Can integrate various other forms of data
  • co-localization of proteins
  • homology with known interacting proteins
  • Rosetta Stone method

56
Acknowledgments
Roland Somogyi Stefanie FuhrmanXiling Wen
NCGR Jason Stewart Pedro Mendes
Harvard Tzachi PilpelMartin SteffenAllegra
PettiJohn AachGeorge Church
UNM Stephanie ForrestAndreas WagnerDavid
PeabodyBarak Pearlmutter
The Santa Fe Institute
Write a Comment
User Comments (0)
About PowerShow.com