Title: Analyzing Time Series Gene Expression Data
1Analyzing Time Series Gene Expression Data
Ziv Bar-Joseph Center for Automated Learning and
Discovery Carnegie Mellon University
2Expression Experiments
Time series Multiple arrays at various temporal
intervals
Static Snapshot of the activity in the cell
3Abundance of time series expression datasets
- Over 30 of the 170 papers perform time series
experiments. - A total of 220 time series datasets.
- More arrays used for time series than for static
expression experiments.
4(No Transcript)
5Unique features of time series expression
experiments
- Autocorrelation between successive points.
- Can identify complete set of acting genes.
- Allows to infer causality.
6Time Series Examples Development
Development of fruit flies Arbeitman, Science 02
7Time Series Examples (cont)
Function
Infectious diseases, response to external stimulus
Interactions and Systems
Transcription factors knockouts
8Time Series Examples Systems
The cell cycle system in yeast Simon et al, Cell
01
9Computational challenges
Computational
Biological
10Sampling Rates
- Non uniform
- Differ between experiments
11Cell Cycle Datasets
12Networks
Pattern Recognition
Data Analysis
13Representing time series expression data
- We are capturing a continuous process with a few
samples. - We need a way to convert our samples for each
gene to an expression profile. - Some simple techniques
- - Linear interpolation
- - Spline interpolation
- - Functional assignment
14Standard interpolation
If we have missing values and noise linear
interpolation will fail to reproduce an accurate
representation.
15Splines
- Instead of linear interpolation, we can use
splines piecewise polynomials. - Still, will overfit when faced with missing
values and noise.
16The power of co-expression
- We can modify our splines to take into account
the fact that many genes are co-expressed.
17Avoiding overfitting
- Require that for each gene ? N(0, ?j)
- Add noise term
18Class Assignment
- In some cases the biological classes are known
in advance.
The algorithm can be modified and combined with a
Gaussian mixture algorithm to perform clustering
of the continuous representation of the
expression data.
19Missing values
20Interpolation
21Alignment
FKH1
- Difference in the timing of similar biological
processes
22Continuous Alignment
Using the estimated splines, we continuously
align two expression datasets by minimizing a
global error function
RECOMB 2002
23Identifying differentially expressed genes
Wild Type
Knockout
- Hard to perform manual comparison.
- Sampling rates and different timing prevent
direct comparison.
Zhu et al, Nature 2000
24Using Global Error to Determine Significance
Key idea Combine individual noise model with a
global error (area between curves) that correctly
captures the temporal difference between the two
profiles.
25Comparing the continuous representation
WT
Knockout
26Enrichment for the Cell Cycle Factors
27Overcoming population effects
Smc3 observed values
- Microarray experiments profile population of
cells. - Initially cells are synchronized, but they lose
their synchronization over time. - Need to compensate for synchronization loss in
order to recover single cell values.
28Networks
Pattern Recognition
Individual Gene
29Pattern recognition and clustering
- Identifying relationships between genes based on
expression profiles. - Handling non uniform sampling rates.
- Determining relationships between clusters.
30Time Shifted and Inverted Profiles
Qian et al Journal of Molecular Biology 2001
31Results
Simultaneous expression profile relationships
Inverted expression profile relationships
Time delayed expression profile relationships
32Hierarchical clustering
- For n leaves there are n-1 internal nodes
- Each flip in an internal node creates a new
linear ordering - There are 2n-1 possible linear ordering of the
leafs of the tree
1
2
33Determine Relations Between Clusters
Optimal leaf ordering selects the ordering that
maximizes the sum of the similarities of adjacent
leaves in the clustering tree.
34Results Synthetic Data
Input
Hierarchical clustering
Optimal ordering
3524 cell cycle experiments
36Short Time Series
- 60 of the time series datasets are short (lt7).
- Over 40000 signals are measured, data is very
noisy and experiments are compared across all
time points. - Most clustering algorithms will miss small sets,
and in addition, cannot be used to compare
datasets.
37Taking advantage of the small number of points
38Networks
Pattern Recognition
Individual Gene
39Systems Biology
- Different types of data provide partial
information about the activity in the cell. - By integrating these data sources we can obtain a
better picture of the activity in the cell. - A lot of current interest though relatively few
methods construct temporal models.
40Dynamic Bayesian Networks
- Bayesian networks are graphical models which can
account for the stochastisity in the data. - Can be extended to handle time series data
(dynamic Bayesian networks). - So far have been used for small scale modeling.
41Modeling tryptophan metabolism on E. coli
Ong et al Bioinformatics 2002
42Genetic RegulAtory Modules (GRAM)
- Gene Modules
- Set of genes that are co-regulated and
co-expressed. - Functional Module
- Collection of gene modules with related function.
43Assembly of the Cell Cycle Transcriptional Regul
atory Network
Blue boxes gene modules
We combine GRAM with our continuous alignment
algorithms to construct a dynamic model for a
sub-network
44Assembly of the Cell Cycle Transcriptional Regul
atory Network
Blue boxes gene modules
Individual regulators ovals, connected to their
modules Dashed line extends from module
encoding a regulator to the regulator protein oval
45Comparing the Continuous Representation
WT
Knockout
46Assembly of the Cell Cycle Transcriptional Regul
atory Network
Blue boxes gene modules
Individual regulators ovals, connected to their
modules Dashed line extends from module
encoding a regulator to the regulator protein oval
47Summary
- Time series expression data can be used to answer
important biological questions. - Pros Autocorrelation, allows for casual
inference, provides a better view of cellular
activity - Cons Large number of signals but small number of
time points, noise, lack of repeats - By using methods specifically developed for this
data we can overcome the above problems and take
advantage of its unique properties
48Want to know more ?
- Z. Bar-Joseph, Analyzing time series gene
expression data Bioinformatics, in press. - www.cs.cmu.edu/zivbj
- zivbj_at_cs.cmu.edu