Title: Michael%20A.%20Beer%20and%20Saeed%20Tavazoie
1- Michael A. Beer and Saeed Tavazoie
- Cell 117, 185-198 (16 April 2004)
2The Authors
Saeed Tavazoie (middle) Professor Dept. of
Molecular Biology
Mike Beer Postdoctoral Researcher Ph.D, Princeton
(1995)
3The Question
- Transcription factor binding sites are
relatively well-characterized in Saccharomyces
cerevisiae - But - the presence of a TF binding site alone
is not sufficient to predict expression of a gene - Multiple regulatory factors are often involved
- How do you identify the elaborate rules for gene
regulation?
4Simple regulatory structures
Each possible combination of TFs must be tested
in the lab This is a hugely time-consuming task..
5Problems with predicting gene regulation
Regulatory motif sequences have low consensus
e.g. The well known TATA box has a consensus
of TATA(A/T)A(A/T)(A/G)
Numerous transcription factors can bind to any
one motif
Many genes have multiple known motifs upstream of
ATG
6Example of cis-regulatory logic
From Yuh et al (1998), Science 279, 1896-1902
7The Approach
1. Using microarray expression data, the authors
built clusters of genes with similar expression
patterns.
From brain expression data in Wen et al (1998),
PNAS 95, 334-339
8The Approach, cont.
2. From groups of genes with similar expression
patterns, a search is undertaken for consensus
sequence motifs within 800bp upstream of ATG in
each cluster.
9The Approach, cont
- 3. The authors built a Markov model using the
TF sequence motifs as parent nodes, and the
expression data as data values. - This can be applied to a gene of interest by
identifying the upstream TF motifs for that gene,
and finding the model(s) that best fits the known
upstream TF motifs. - If the expression data is within the parameters
predicted by the model, then there is a decent
chance that its associated gene regulatory
structure can be verified experimentally.
10Two examples from yeast
Both clusters have at least 10 genes each, and
there is some confidence that genes with the same
upstream TFs will exhibit the same expression
pattern as these clusters.
11Constructing the models
Using expression data from 30 microarrays, the
authors identified 5547 genes with significant
expression levels in yeast, and this data was
used to construct 49 models of expression
patterns.
12Predictive accuracy
These 49 models were applied to five test sets of
expression data, using only the upstream 800 bp
region as input. They found that the expression
pattern was correctly predicted for 1898 genes
out of the test set(s) of 2587 genes. This
amounts to 73 accuracy (random would be 1/49, or
2).
13Application to C. elegans
Given the larger amount of regulatory sequences
in higher order organisms, and the potential for
more complex regulation, the authors had low
expectations for applying this model to C.
elegans. Using 2000 bp of upstream sequence,
and microarray expression data including Hill
(2000), the authors were surprised to learn that
they could predict expression patterns for
roughly half of the genes in the C. elegans
dataset.
14An example from C. elegans
15Is it really so simple?
Gene regulation involves a complex combinatorial
dance of numerous factors aside from the presence
or absence of TF binding sites. The authors have
deliberately limited their scope to cis-acting
upstream factors-- ignoring regulatory elements
in introns or downstream regions, as well as the
effects of operons, alternative splicing, histone
modifications, methylation, et cetera
16Model constraints
- Several bits of information were found to be
significant factors in improving the predictive
accuracy of the models - Motif orientiation ( lt--- or ---gt )
- Distance from the start codon
- The particular order of various TFs
- The presence of multiple copies of the same TF
- All of those factors were included in the model
as priors.
17Why is distance from the start codon significant?
From Harbison et al (2004), Nature 431, 99-104
18The number of copies of a TF binding site is
relevant..
From Molecular Biology of the Cell, 4th edition
19Motif combinatorics and predictive accuracy
Combinatoric models are more accurate than
single-TF models (unless a gene is under the
control of only one TF).
The order of various TFs is significant
20Future directions..
Because of the sensitivity of the model(s), even
a very small amount of ambiguity can yield junk
results. For this reason, SAGE data is not
particularly suitable, as only unique SAGE tags
can be said to be unambiguous this in turn
excludes all sorts of potentially useful
data. However, we could use the microarray-based
predictions to pick gene regulatory structures to
investigate..