Title: Finding regulatory modules: A statistical approach
1Finding regulatory modules A statistical approach
- Mikhail Velikanov
- Linnaeus Centre for Bioinformatics
2Introduction
- Regulatory modules (RMs) sets of regulatory
sites that work cooperatively - TF binding sites and promoter elements
- Splicing enhancers and suppressors
- Site clusters
- site A AND (site B OR site C) AND (NOT site D)
- Beads on a spring
- site B is 20 3 bp downstream of site A
- Distance distributions have a short range and a
well-defined peak
3Searching for RMs Setup of the Problem
Motifs
Annotations
- Seq. length constant and small (0.5 kb)
- Num. of sites 20
- No overlapping sites
- Sites characterized by
- Identity
- P-values ( pt)
- shown by width
Look for annotation patterns that occur
consistently in all or some of the sequences.
4RMs as Annotation Alignments
- Align sites by identity
- Find sequences of 2 or more sites shared across a
number of annotations (common annotations) - Conditions
- Distances between sites are similar
- P-values of aligned sites are similar
- P-values of aligned sites are small
Need a function that measures how well conditions
(1-3) are satisfied (strength of common
annotation).
5Strength of common annotation site p-values
- Assume a common annotation of S sites supported
by N sequences - For the i-th site, let pimin, pimax be the
smallest and the largest of the N p-values - pimax measure of how small p-values are
- Ri pimax/pimin measure of similarity
- Probability pi of observing p-values as similar
and as small in N random annotations
6Strength of common annotationdistances between
sites
- Account for no overlaps between sites
- renormalization of pi for each site
- p0 1 - ?pi positions between sites
- Compute approximately probability of common
annotation PCA as a function of p0, p1, , pS - Strength of common annotation
- Z -ln PCA
S
i1
7Searching for the strongest common annotations
- Given an input set of annotations, define groups
of annotations such that - each group has at least one common annotation
- the strongest common annotation of each group is
distinct - NB Groups may fully or partially overlap!
Cannot use standard clustering algorithms.
8Classification Algorithm
- Find pairs of annotations with at least one
common annotation - Each pair is a nucleus of a potential group
- Each group grows by adding annotations one at a
time - the group retains its strongest common annotation
at each step - each addition maximizes the group strength
- annotation added to one group remains available
for addition to other groups - Where does the growth stop?
(strength group strength, Zg)
9Stopping criterion
- No more annotations can be added
- group contains all annotations in the input set
- change in the strongest common annotation
- Formed during growth of another group
- ignore current group (pruning)
- Group strength is too small
- adding an unrelated annotation
- group strength Zg is a score (Zg gt 0)
- can be computed for groups of random annotations
- by the extremal types theorem
- lim Prob(Zgrand gt Zg) 1 - exp-(Zg/b)-a
- threshold on Zg
- numerical calibration of a, b for all possible N,
S
n ? 8
10From annotation groups to RMs
- Need a way to
- account for optional sites
- search for homologous RMs
11RMs as generalized HMMs
- Generalized (duration) HMMs (gHMMs or dHMMs)
consist of 2 types of states - motif states (PSSMs)
- annotation sites
- spacer states (distance distributions)
- gaps between sites
- States are connected according to certain
topology - Transitions probabilities depend only on the
present state - Common annotations of groups are simple gHMMs
12RMs as generalized HMMs
Can make a single model because of the overlap!
- Common annotations define gHMM states
- Overlaps define topology and provide estimates of
transition probabilities - Multiple matches to the model
13From annotations to RMs
RMs
14Testing the Method Test 1
- 25 random DNA sequences, 20 are seeded with an
RM - 2 sites with low p-value (lt 10-3) separated by 20
25 bp - Scan sequences with unrelated motif subject to
p-value threshold - 3rd site (random noise in annotations)
m0
m1
15Testing the Method Test 2
- 25 random DNA sequences, 2 non-overlapping groups
of 10 and 11 sequences - each group is seeded with a distinct RM (2
sites) - distance between sites is 20 25 bp or 52 55
bp - Extra site added as before
16Testing the Method Test 3
- 25 random DNA sequences, 2 overlapping groups of
12 and 14 sequences - same RMs as in previous test
- groups overlap by 5 sequences
- Extra site added as before
17Summary
- A method for discovery of regulatory modules
given a set of annotated sequences - Builds RMs from recurrent annotation patterns
- Treats site p-values and distances in consistent
statistical framework - Can use prior information on RMs (Bayesian
approach) - RMs are output as gHMMs
- flexibility of RMs structure (topology)
- searching for homologous RMs
18Future developments
- Testing the method on real data
- upstream regions of bacterial operons
- bacterial Fe-regulons
- other benchmark sets?
- Algorithm improvements
- better stopping criterion (use properties of
distance distributions) - more precise computation of common annotation
strength - better similarity measure for site p-values
(reduce compensation)
19Acknowledgements
- Thanks to David Ardell (LCB, Uppsala) and
Georgiy Sofronov (Univ. of Queensland, Brisbane)
for many fruitful discussions