Title: Finding Transcription Modules from large geneexpression data sets
1Finding Transcription Modules from large
gene-expression data sets
- Ned Wingreen Molecular Biology
- Morten Kloster, Chao Tang NEC Laboratories
America
2Outline
- Introduction transcription, regulation, gene
chips, and transcription modules. - Iterative Signature Algorithm (ISA).
- Advantages of Progressive Iterative Signature
Algorithm (PISA). - PISA applied to yeast data.
3Transcription regulation
http//doegenomestolife.org
4Gene chips
DNA microarray
5Gene-expression profile
Egc g1,2,...,Ng c1,2,...,Nc
But data very noisy
6Transcription module
Conditions
C1
C2
C3
Genes
G1
G7
G2
G3
G4
G5
G6
A Transcription Module a set of conditions and a
set of genes connected by
a transcription factor.
7Signature of a transcription module
A gene can be in multiple transcription modules.
8Iterative Signature Algorithm (ISA)
Barkai group (2002,2003)
Conditions
c1 c2 c3 cm cn ... ... cNC
Transcription Module (TM)
Thresholding
Gene vector and condition vector
Genes
g1 g2 g3 . . gi . . gj . . gNG
Thresholding on both genes and conditions reduces
noise.
9Limitations of ISA
- Lots of spurious modules (millions).
- Weak modules may be absorbed by strong ones.
- ISA does not make use of identified modules to
find new ones.
c1 c2 c3 cm cn ... ... cNc
g1 g2 g3 . . gi . . gj . . gNg
10Progressive Iterative Signature Algorithm (PISA)
c1 c2 c3 cm cn ... ... cNc
g1 g2 g3 . . gi . . gj . . gNg
11Advantages of PISA over ISA
- Removing found modules reveals hidden modules,
and reduces noise for unrelated modules. - No positive feedback.
- Improved thresholding for genes.
- Combines coregulated and counter-regulated genes.
12Example of PISA vs. ISA
A
B
TF1
TF2
G1
G2
13The gene-score threshold
Gene scores along the condition vector for some
module
- Goal less than one gene included in the module
by mistake. - Require threshold that is insensitive to
(unknown) module size.
14Eliminating false modules
For scrambled data, preliminary modules either
have few genes or few contributing conditions.
True positives
15PISA applied to yeast data
- Applied PISA to a dataset containing almost all
available microarray data for S. cerevisiae
gt6000 genes, 1000 conditions. - Found 140 different modules, including all
good modules found by ISA. - Found some unknown modules.
- Found many good small modules that ISA could
not find / separate from the spurious modules. - 2600 genes in at least one module, 900 genes in
more than module.
16Some modules found by PISA
17Example Zinc module
ZRT1
ZRT3
ZRT2
ZAP1-regulated genes during zinc starvation.
INO1
ZAP1
ADH4
YNL254C
YOL154W
Zinc module found by PISA
Lyons et al., PNAS 97, 7957-7962 (2000)
18Comparison with other databases
Gold standard Gene Ontology (Genome Res. 11,
1425-1433 (2001))
Database A Immunoprecipitation (Lee et al.,
Science 298, 799-804 (2002)) Database B
Comparative genomics (Kellis et al., Nature 423,
241-254 (2003))
19 rRNA processing
(117) Ribosomal
proteins (126) Histone
(19) Fatty acid syn (22)
Cell cycle G2/M (31) Cell
cycle M/G1 (35) Cell cycle G1/S (66)
Correlations
Mating genes for type a (15) Mating type a
signaling genes (6) Mating (110) Mating
factors/receptors a/a difference (26)
Oxidative stress response(69) De novo purine
biosyn (32) Lysine biosyn (11) Biotin syn
transport (6) Arg biosyn (6) aa biosyn (96)
Oxidative stress response (69)
aryl alcohol dehydrogenase (6)
proteolysis (27) trehalose
hexose metabolism/conversion (21)
COS genes (11)
heat shock (52)
repair of disulfide bonds (26)
anticorrelated
correlated
20Summary
- Data from gene chips can be used to identify
transcription modules (TMs). - Iterative approach (ISA) is promising.
- PISA improves on ISA by taking out found TMs.
- PISA also improves gene thresholding, avoids
positive feedback, and improves signal to noise
by grouping coregulated and counter-regulated
genes. - PISA very effective for finding secondary
modules.
http//cn.arxiv.org/abs/q-bio/0311017
21Future Directions
- Input to experiment
- new modules and new genes in old modules.
- what kinds of experiments give the most
informative data? - Improve PISA
- better pre/post-processing of data.
- Apply PISA to other organisms.
- Combine PISA with other data (experimental,
bioinformatic) to systematically identify TMs,
and reconstruct the transcription network.
22De novo purine biosynthesis
Number of genes 32 Average number of
contributing conditions 14.6 Consistency
0.59 Best ISA overlap 0.59 at tG5.0 frequency
16
23Galactose induced genes
Number of genes 23 Average number of
contributing conditions 18.1 Consistency
0.55 Best ISA overlap 0.74 at tG3.2 frequency
686
24Hexose transporters
Number of genes 10 Average number of
contributing conditions 33.7 Consistency
0.59 Best ISA overlap 0.6 at tG3.8 frequency 41
25Peroxide shock
Number of genes 69 Average number of
contributing conditions 23.9 Consistency
0.50 Best ISA overlap 0.34 at tG3.4 frequency
(1)
26Implementation of PISA
- Normalization of gene-expression data
- Iterative algorithm to find preliminary modules
(modified ISA) - avoiding positive feedback
- gene-score threshold
- Orthogonalization
- Finding consistent modules
27Normalization of expression data
Gene-score matrix EG
normalizes total RNA levels
removes reference-condition bias
makes gene scores comparable
Condition-score matrix EC
makes condition scores comparable
?
28Iterative algorithm modified ISA (mISA)
Start with a random set of genes GI.
Produce condition-score vector sC.
Produce gene-score vector sG, using
leave-one-out scoring to avoid positive
feedback.
From sG, calculate gene vector mG for next
iteration.
29Orthogonalization
After finding each converged preliminary module
(sG, sC), remove component along sC from all
genes
30Why does scrambled data yield large modules?
Long tails of expression data lead to
single-condition modules.
31Finding consistent modules
- Repeat PISA runs many times (30).
- Tabulate preliminary modules.
- A preliminary module contributes to a module if
- the preliminary module contains gt 50 of the
genes in the module, - these genes constitute gt 20 of the preliminary
module. - A gene is included in a module if it appears in
gt50 of the contributing modules, always with the
same gene-score sign.
32Comparison with other databases
Gene Ontology (Genome Res. 11, 1425-1433 (2001))
Ng number of genes in organism m number of
genes in module c number of genes in GO
category n number of genes in both module and
GO category
p value
Database A Immunoprecipitation (Lee et al.,
Science 298, 799-804 (2002)) Database B
Comparative genomics (Kellis et al., Nature 423,
241-254 (2003))
33Correlation of modules
Conditions
c1 c2 c3 cm cn ... ... cNc
Genes
g1 g2 g3 . . gi . . gj . . gNg