Title: GECCO Report
1GECCO Report Some Issues about TFBSs
- Tak-Ming Chan
- August 2 2007
2GECCO Report
- Genetic and Evolutionary Computation Conference
(GECCO) 2007 - Submissions totaled 577, Total accepted full
papers 266Accepted as posters (abstracts) 210
3GECCO07
- Tutorials Workshops (2 days)
- Introductory GA, GP, EDA,
- Advanced Representations, Fitness landscapes,
Problem hardness, - Specialized Open-source software,
Bioinformatics, Complex Networks, - Presentations (3 days)
- 14 tracks GA(109/43), GP(54/27), ES/EP(21/11),
Real-World Applications (99/48), Biological
Applications (25/10 ), - All sessions go simultaneously
4GECCO07
- Keynote Event
- Public Debate on Complexity and Evolution
- with Richard Dawkins (The Selfish Gene), Lewis
Wolpert and Steve Jones (Almost Like a Whale) - Different view from the biologists about
evolution, constraints, complexity - Audio download
http//www.cs.ucl.ac.uk/staff/p.bentley/evodebate.
html
5Digest of some Presentations
- Real-World Applications
- A robust GP solution for hedge fund stock
selection 1 - Robustness for volatile and extreme scenarios
- Fitness mean profit over the volatility (std.)
of the scenario - Genetic Algorithms
- Simple diversity mechanisms analysis 2
- Different diversities (in two ( 1) EAs)
suitable for different problems (two plateau
functions) - Genotype diversity better in a multimodal
problem - Phenotype diversity better in a needle in a
haystack problem
6Digest of some Presentations
- Biological Applications
- Prof. Congdon, the chair of my session had a
conversation with me - GAMI (consensus-led) addressed the TFBS
identification in a multimodal way
(unintentionally) - In further work, they found Information Content
was not ideal as expected for their datasets - Check it out for more in
- 1 W. Yan, C. D. Clack, Evolving Robust GP
Solutions for Hedge Fund Stock Selection in
Emerging Markets, Proceedings of GECCO 07,
pp2234-2241 - 2 T. Friedrich, N. Hebbinghaus, F. Neumann,
Rigorous Analyses of Simple Diversity Mechanisms,
Proceedings of GECCO 07, pp1219-1225
7A little bit more about London
- Tired of the Big Ben, Tower Bridge, Buckingham
Palace? - Try the Wellcome Collection near UCL
- Especially good for seeing some interesting
things about Medicine and Genetics
8Some Issues about TFBSs
- Information Content (IC)
- Similarity measures between PWMs
Regulates Gene Expression
Transcription
Transcription Factor
TFBS
Gene
9Evaluation of IC as a Metric for TFBS
Identification 3
- Incorporated IC in GAMI which originally employed
Match Count (MC) (by Congdon et al) - Expected IC should be more accurate than MC
but it turned out to be not - IC missed some 100 conservation regions while MC
did not - Several possible problems of IC addressed
- Background frequencies
- Different IC scores to a motif and the reverse
complement - Synonyms problem
3 An Evaluation of Information Content as a
Metric for the Inference of Putative Conserved
Noncoding Regions in DNA Sequences Using a
Genetic Algorithms Approach, in 2006 IEEE
TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND
BIOINFORMATICS
10Scores for MC and IC
- Rough correlation different peaks
11Background frequencies
- IC has a very strong preference for CG-rich
regions for the dataset (SOX21) - The background frequencies
- However, the areas of conservation are not always
CG-rich
12Background frequencies
IS an issue!!
- IC is not suitable for a strongly biased region
- Li Gang raised the similar question (a 2-letter
case for simplification) - log(0.9/0.8) 0.9 log(0.1/0.2) 0.1 0.0367
log(0.5/0.8) 0.5 log(0.5/0.2) 0.5 0.2231 - IC favors the random nucleotides over the
conserved ones! - Further questions raised (by Cyrus)
- Is it possible the promoter regions are really
biased in real-world problem? - In this paper, the sequences in the datasets
tested are 8kb-10kb long what about the promoter
regions?
13Different IC scores to a motif and the reverse
complement
MAY be an issue!!
- Highest scoring motif with IC
- caggcaccactcactgcccc (207.92)
- the reverse complement C is less than G in
SOX21 ggggcagtgagtggtgcctg (189.73) - Both score 117 (of a possible 120 (206)) with MC
- Questions
- Should the background sequences be recalculated
for reverse complement? - How are the motif instances aligned according to
this motif (see the next issue of synonyms)?
14Synonyms problem
- Different motifs correspond to the same IC score
- Have a look at how the motif instances are chosen
and aligned for IC in this work
High MC motif
Lower-quality synonym
15Synonyms problem
- The procedure
- First, GAMI (Consensus-led) is performed to
locate the best motifs based on MC - One motif may correspond to several instances in
the sequences (they may together contribute to
the same MC) - ATCGATCG ATCGATGG or ATCGAACG
- When IC is tested, it is calculated only based on
the combinations of these instances (up to 1000
combinations)
16Synonyms problem
NOT an issue!!
- The result is the motif instances combination
may correspond to another consensus but are
forced to align with the specific motif - The representational power of consensus
- The inappropriate use of ICit is used for
position-led representations!
17Future work concerning IC
- Some of the above issues have to be worked on
- Background frequencieshow to estimate?
- Forward and Backward strands
- Positional background frequencies
- Pseudo-counts
- Large pseudo-counts have pronounced affect on IC,
especially for small sample size - Some work is proposed to estimate appropriate
pseudo-counts - More
- Context information of the instances
- Additional information
18Similarity measures between PWMs 4
- Useful in EC for maintaining diversity
- For position-led representation
- Position frequency matrix (PFM)
- Position weight matrix (PWM)
19D for PFMs
D (stands for the distance) is incremented by 1
20For PFMs with different widths
- Various shifts are tried to get the minimal D
(note that overlap is at least 6 nucleotides)
21Further issues
- To adopt the D for PFMs in our work
- Change the actual counts to normalized
frequencies for the matrix used in our GA - C (Correlation Coefficient) for PWMs
- Use a random DNA sequence to compute the sum of
weights of the PWMs and then use C to measure the
scores (skipped, not so practical) - The paper 4 did more than proposing D and C
- Beyond this presentation
4 Measuring similarities between transcription
factor binding sites, Sep 2005 BMC Bioinformatics
(if 3.62)
22The End
- Thank you very much!
- Q and A?
- Conference Report
- Issues for TFBSs
My footprint at Wellcome Collection Health
is most important