Title: Phylogenetic footprinting
1Phylogenetic footprinting
Topics in Computational Biology Ilkka
Vaahtoranta 4.3.2004
2Phylogenetic footprinting
- Introduction
- Methods used
- Substring Parsimony Problem (Torsten)
- Results
3Introduction
4The Problem
- Major challenge of current genomics is to
understand how gene expression is regulated. - An important step towards this understanding is
the capability to identify regulatory elements.
5In a Nutshell
- Phylogenetic footprinting is a method for the
discovery of regulatory elements in a set of
orthologous regulatory regions from multiple
species. It does so by identifying the best
conserved motifs in those orthologous regions.
- Idea of phylogenetic footprinting was first
invented as early as 1988 (Tagle, Koop,
Goodman...) - It was at that time little ahead of its time
- Only few sequences from related species were
available
6Orthologous vs. Analogous
- Orthologous sequences have the same function in
different species and are related - Analogous sequences have same kind of function
but are not related - Phylogenetic footprinting uses othologous
sequences
7Regulatory Elements
RE's
Exon
Intron
5
Promoter sequence
Gene
Promoter
- Lies usually before the actual gene
- Rarely after the gene
- Aproximately 600-1000 bp long sequence
- Holds regulatory elements
Regulatory elements
- Relatively short sequences, from 5 to 25 bp long
- May hold gaps
- Appear in othervice non-functional sequence
8Multiple genes in single species
- Single species
- Related genes
- This technique is used to find common regulatory
factors - Only in given organism
- REs of single gene are not found
- This is not phylogenetic footprinting
9Multiple species with orthologous regulatory
regions
What do we need to identify regulatory elements?
- Set of orthologous non-functinal DNA from species
that are related - For an example one might use the non-coding
sequence of insulin in ten different vertebrates - If well conserved, possible RE
- This is phylogenetic footprinting
10Why examine non-functional sequences?
- Functional sequences evolve slower rate than
non-functional sequences cause of the selective
pressure - A transition in a functional sequence (gene) may
change the whole function of coded protein - A transition in a non-functional sequence (RE)
may only change expression freqvency of a gene
11Phylogenetic footprinting exploits the mutation
rate difference of functional and non-functional
sequences
12Methods used
13Global Multiple Alignment
- CLUSTLAW, GMA tool
- Global Multiple alignment drawbacks
- It is a np hard problem.
- If optimal MA could identify all REs, we could
not compute it. - Because REs are quite short (10 in 1000
nucleotides), noise of diverged non-functional
sequences will overcome the short conserved
signal.
14Classical motif finding
- Standard motif finding (MEME, AlignAce,
ANN-Spec....) - Segment based motif finding (DIAGLIN...)
- Outperform global multipple alignment
All have important shortcoming
- Do not take account phylogenetic relationships
- Closely related sequences have too high weight
15Substring Parsimony Problem, a motif finding
algorithm
- Formalization of the PF idea
- Also NP-hard problem but easy to tune up to
eliminate exponential behavior - Substring parsimony searches for best alignments
in given sequence set. - Difference between substringparsimony and
multiple alignment lies in given phylogenetic
tree. - Multiple alignment does not care about
relationships of given species. This leads in
situation where closely related sequences of
given set gets relatively high weight in the
solution.
16Substring Parsimony Problem
Your turn Torsten...
17Results
18The Footprinter
- Available for free at http//bio.cs.wshington.edu
/software.html - Uses substring parsimony method to define
possible motifs - Is under constant development
- Example data at
- http//www.soe.ucsc.edu/blanchem/gh1/some_gh1.fa.
main.html
19Algorithm performance
- FootPrinter program
- Data set c-myc proto-oncogene upstream sequences
- k12
- d3
- n10
- L varies between 450 to 900 nucleotides
- Solution 3 distinct conserved substrings
- Computer P3 550MHz, 512 RAM
20Algorithm performance
21FP, Example results 1
- Metallothionein Gene family
- Promoter sequences available for wide variety of
species - REs have been experimentally determined in
several species - 4 major isoforms
- FootPrinter
- 590 bp upstream
- K7,8,9,10
- 12 highly conserved regions of wich 4 have been
confirmed - REs found present in most of the isoforms
22(No Transcript)
23FP, Example results 2
- Insulin gene family
- two rodents and a pig (two gene copies each)
- motifs with 0 mutations K8
- motifs with 1 mutation K9,10
- Footprinter
- Find 4 verified motifs
- Many were missed cause they contained too many
mutations - Search for motifs lengths of 12 and 15 did not
fix the results
24FP, insulin gene family
- Why many known binding sites were lost?
- Five categories
- No matches in other species
- Concerved regions are shorter than looked for
- Insertions and deletions not allowed
- Motifs do not meet statistical thresholds
- Motifs with internal variable sequence
25Other computational methods
- Clustalw
- Tree based global multiple alignement tool
- Good results with closely related sequences
- Bad results if sequences are diverged
- As fast as Footprinter on this test set
- Diaglin
- Segment based multiple alignment tool
- Good results cause of search for short concerved
regions - Same set of found motifs as footprinter on this
test set - 10 times slower than Footprinter on large
datasets
26Other computational methods
- MEME
- Motif finding tool
- Searches for motifs with high information content
- Motifs my appear in different order in sequences
- Approximately same set of found motifs as
Footprinter - As fast as Footprinter on this test set
27Future work
- Lack of sequences
- Inaccuracies in phylogenetic tree
- More need to know on what to expect
- More filtering accuracy
- How unusually well concerved the region is
- Predicting pairs and triplets
28Questions