Title: Promoter Discovery: A Correlation Mining Approach
1Promoter Discovery A Correlation Mining Approach
- Yi Lu
- Department of Computer Science
- Wayne State University
2Outline
- Introduction
- Related Work
- Problem Definition
- Correlation Mining
- Conclusion and Future work
3Introduction
- Central Dogma
- Gene Expression
4Introduction
- The promoter region (a set of transcription
binding sites) of the gene acts as light switch.
It signals when to turn the gene on and off. - We are interested in the relationship between the
promoter region and gene expression. i.e. what
kind of binding sites determine whether a gene is
expressed or not?
5Introduction - Microarray
Microarray chips
Images scanned by laser
Gene Value D26528_at
193 D26561_at 70 D26579_at
318 D26598_at 1764 D26599_at
1537 D26600_at 1204 D28114_at
707 H29189_at 899 G29183_at 9210
Datasets
D1 D2 D3 D4..
D26528_at D26561_at D26579_at D26598_at D26599_at
D26600_at D28114_at .. ..
6Introduction
- Transcription factor binding sites (motif) in
promoter region should explain changes in
transcription.
Time Course genes
AGCTAGCTGATTGTGCACACTGATCGAGCCCCACCATAGCTTCGTTGTG
CGCTATATATTGTGCAGCTAGTAGAGCTCTGCTAGAGCTCTATTTGTG
CCGATTGCGGGGCGTCTGAGCTCTTTGCTCTTTTGTGCCGCTTTTGAT
ATTATCTCTCTGCTCGTTTGTGCTTTATTGTGGGGGTTGTGCTGATTAT
GCTGCTCATAGGAGATTGTGCGAGAGTCGTCGTAGTTGTGCGTCGTCG
TGATGATGCTGCTGATCGATCGTTGTGCCTAGCTAGTAGATCGATGTT
TGTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCTTGTGCTCG
AGAGGAAGTATATATTTGTGCGCGCGCCGCGCGCACGTTGTGCAGCTGA
TGCATGCATGCTAGTATTGTGCCTAGTCAGCTGCGATCGACTCGTAGC
ATGCATCTTGTGCAGTCGATCGATGCTAGTTATTGTTGTGCGTAGTAG
TGCTTGTGCTCGTAGCTGTAG
AGCTAGCTGATTGTGCACACTGATCGAGCCCCACCATAGCTTCGTTGTG
CGCTATATATTGTGCAGCTAGTAGAGCTCTGCTAGAGCTCTATTTGTG
CCGATTGCGGGGCGTCTGAGCTCTTTGCTCTTTTGTGCCGCTTTTGAT
ATTATCTCTCTGCTCGTTTGTGCTTTATTGTGGGGGTTGTGCTGATTAT
GCTGCTCATAGGAGATTGTGCGAGAGTCGTCGTAGTTGTGCGTCGTCG
TGATGATGCTGCTGATCGATCGTTGTGCCTAGCTAGTAGATCGATGTT
TGTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCTTGTGCTCG
AGAGGAAGTATATATTTGTGCGCGCGCCGCGCGCACGTTGTGCAGCTGA
TGCATGCATGCTAGTATTGTGCCTAGTCAGCTGCGATCGACTCGTAGC
ATGCATCTTGTGCAGTCGATCGATGCTAGTTATTGTTGTGCGTAGTAG
TGCTTGTGCTCGTAGCTGTAG
AGCTAGCTGATTGTGCACACTGATCGAGCCCCACCATAGCTTCGTTGTG
CGCTATATATTGTGCAGCTAGTAGAGCTCTGCTAGAGCTCTATTTGTG
CCGATTGCGGGGCGTCTGAGCTCTTTGCTCTTTTGTGCCGCTTTTGAT
ATTATCTCTCTGCTCGTTTGTGCTTTATTGTGGGGGTTGTGCTGATTAT
GCTGCTCATAGGAGATTGTGCGAGAGTCGTCGTAGTTGTGCGTCGTCG
TGATGATGCTGCTGATCGATCGTTGTGCCTAGCTAGTAGATCGATGTT
TGTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCTTGTGCTCG
AGAGGAAGTATATATTTGTGCGCGCGCCGCGCGCACGTTGTGCAGCTGA
TGCATGCATGCTAGTATTGTGCCTAGTCAGCTGCGATCGACTCGTAGC
ATGCATCTTGTGCAGTCGATCGATGCTAGTTATTGTTGTGCGTAGTAG
TGCTTGTGCTCGTAGCTGTAG
R(t2)
7Related work
- Cluster gene expression profiles
- Search for motifs in promoter regions of
clustered genes
8Related work
- Clustering
- partition the N genes to a set of disjoint groups
so that the expression profile of genes in same
group have high similarity to each other and the
expression profile of genes in different groups
are dissimilar to each other. - Most widely used algorithms K-means clustering,
hierarchy clustering algorithms. - Genetic K-means algorithms (Lu et al. 2003,
2004).
9Related work
- Motif discovery after clustering
- given a set of upstream sequence of genes which
are co-expressed, find subsequences that are
overrepresented and are significant to be
separated from other subsequences - MEME, Gibbs Sampling, Winnower algorithms.
- PDC algorithm (Lu et al. 2006)
- Usually have high false positive rate
10Motivation
- Researches have indicated that multiple
transcription factor binding sites are involved
into each transcription process. This lead us to
study the Modules (a pair of motifs) instead of
Motifs.
11Motivation
- Not all genes contain the same motif cause the
same gene expression change. - Not all genes with same gene expression change
contains same motif.
12Problem Definition
- Given a list of genes, and corresponding module
present information, gene expression information,
find the relationship between module and gene
expression, i.e. which modules or module
combinations may relate to the gene expression
change. - M1 M2 gt increase gene expression change from
Day 1 to Day 4
13Method - Quantify Gene Expression
14Method - Quantify Gene Expression
15Method Generate Frequent Module Set
- Frequent module sets (occurrence gt2)
- M1(4), M2 (3), M3 (2)
, M4(1)
M1M2 (3), M1M3 (2)
, M2M3 (1)
M1M2M3(1)
16Method Generate Frequent Gene Expression Set
- Frequent gene expression sets (occurrence gt2)
- E1 (2), E1- (0), E2 (1), E2-(3), E3 (0),
E3-,(2), - E1E2-(1), E1E3-(1), E2-E3- (2)
17Correlation Measure Contingency Table
- The relation between u and v in the pair (u,v)
18Liddell Measure
- Liddell ( 21-10)/(22) 0.5
19Method Correlate Module Set with Gene
Expression Set
- Minimize module set
- Maximize gene expression set
- Minimum Liddell value is set to 0.5/-0.5, then
the result sets - M2 -gtE1
- M2 -gt (E2- E3-)
- M3 -gtE2- E3-
20Result on Spermatogenesis
- Spermatogenesis is the biological process related
to formation of sperm. Two gene expression data
sets are downloaded from GEO (Gene Expression
Omnibus). - The time course of one dataset ranges from day 0,
3, 6, 8, 10, 14, 18, 20, 30, 35, and 56. And the
other ranges from day 1, 4, 8, 11, 14, 18, 21,
26, 29, and 60.
21System Workflow
- GEO Gene Expression Omnibus
- DBTSS DataBase of Transcriptional Start Sites
- TRANSFAC the Transcription Factor database
- JASPAR The high-quality transcription factor
binding profile database
22Conclusion
- Not only same module combination result, but also
the same genes that contain the module
combinations have been pulled out between the two
datasets. - The promoter detected using our approach
statistically shows significance than random
generated datasets. - Some promoters found by our approach are
confirmed by literatures.
23Future work
- The concordance between the two gene expression
datasets downloaded from GEO are low, new method
to reconcile the difference between two data sets
is needed. - Motifs found by different algorithms are
overwhelming, we may incorporate the weight
matrix and gene ontology to identify the
significant ones.
24References
- Gene Expression Clustering
- Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng
and Susan Brown, "FGKA A Fast Genetic K-means
Clustering Algorithm", in Proceedings of the 19th
ACM Symposium on Applied Computing, Nicosia,
Cyprus, March, 2004. - Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng,
and Susan Brown, Incremental Genetic K-means
Algorithm and its Application in Gene Expression
Data Analysis, International Journal of BMC
Bioinformatics, 5(172), October, 2004. - Motif Discovery
- Yi Lu, Shiyong Lu, Farshad Fotouhi, Yan Sun and
Zijiang Yang, PDC Pattern Discovery with
Confidence in DNA Sequences, In the proceedings
of the IASTED International Conference on
Advances in Computer Science and Technology (ACST
2006), Puerto Vallarta, Mexico, January, 2006 - Motif Extraction, Module Integration
- Adrian E. Platts, Yi Lu, Stephen A. Krawetz,
K-SPMM, an Online System for Data Mining
Regulatory Elements from Murine Spermatogenic
Promoter Sequences, presented in 2006 Great
Lakes Mammalian Development Meeting, Toronto,
March 3-5 2006. - Yi Lu, Adrian E. Platts, Charles G. Ostermeier,
Stephen A. Krawetz, A Database of Murine
Spermatogenic Promoters Modules Motifs,
Submitted to Journal of BMC Bioinformatics for
publication. - Correlation Mining
- Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram
and Stephen Krawetz, "Correlation Mining to
Reveal the Regulation of Transcription Factor
Binding Site Modules", 4th Great Lake
Bioinformatics Retreat, Frankenmuth, Michigan,
August, 2005. - Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram
and Stephen Krawetz, Mining of Correlation
Between Transcription Binding Sites and Gene
Expression Profiles, In preparation.
25(No Transcript)
26Acknowledgements
- Dr. Shiyong Lu
- Dr. Stephen Krawetz
- Mr. Adrian Platts
- Dr. Jeffrey Ram
- Dr. Youping Deng
27Questions?