Title: Peter J' Bickel
1Refined Non Parametric Methods for Genomic
inference
- Peter J. Bickel
- Department of Statistics
- University of California at Berkeley, USA
Joint work with Nancy R. Zhang (Stanford), James
B. Brown (UCB) and Haiyan Huang (UCB)
2Motivating Questions
3Association of functional annotations in Human
Genome
? Transcription Start Sites (TSSs)
5'
3'
3'
5'
- The ENCODE Consortium found that many
Transcription Start Sites are anti-sense to
GENCODE exons - They also found vastly more TSSs than previously
supposed - Is the association between TSSs and exons in the
anti-sense direction real, or experimental noise
in TSS identification?
4Association of experimental annotations across
whole chromosomes
Do two factors tend to bind together more closely
or more often than other pairs of factors? Does
a factors binding site relative to TSSs tend to
change across genomic regions?
5The statistical relation of Transcription Start
Sites and protein binding sites
Figure from ENCODE Consortium Paper Nature, June
14th, 2007
Normalized signal intensity
Enchancer activity?
- Normalized Chip-chIP signals around GENCODE TSSs
in ENCODE regions - Most peak over the TSS and are clearly
significant - Does the upstream bump in CTCF constitute good
evidence of enchancer binding activity?
6What is a non-parametric model for the Genome and
why is it needed?
7Feature Overlap the question
- A mathematical question arises
- Do these features overlap more, or less than
expected at random?
5'
3'
8Our formulation
- Defining expectation and at random
- The genome is highly structured
- Analysis of feature inter-dependence must account
for superficial structure - Expected at random becomes
- Overlap between two feature sets bearing
structure, under no biological constraints
9Naïve Method
- Treating bases as being independent with same
distribution (ordinary bootstrap) - Hypothesis Feature markings are independent
- Specific Object Test based on
- Feature Overlap ( Feature1)( Feature2)
- and standard statistics
- Why naïve ? Bases are NOT independent
- Better method keeping one type of feature fixed
and simulating moving start site of another
feature uniformly (feature bootstrap) - Why still a problem?
- Even if feature occurrences are independent
functionally, there can be clumping caused by the
complex underlying genome sequence structure - (i.e. inhomogeneity, local sequence
dependence)
10A non parametric model
- Requirements
- It should roughly reflect known statistics of the
genome - It should encompass methods listed
- It should be possible to do inference, tests, set
confidence bounds meaningfully
11Segmented Stationary Model
- Let Xi base at position i, i1,,n
- such that for each k1,,r,
is - Stationary (homogeneity within blocks)
- Mixing (bases at distant positions are nearly
independent) - r ltlt n
12Empirical Interpretations
- Within a segment
- For k small compared to minimum segment length,
statistics of random kmers do not differ between
large subsegments of segment - Knowledge of the first kmer does not help in
predicting a distant kmer - Remark
- If this model holds it also applies to derived
local features, e.g. I1,,In where Ik 1 if
position k belongs to binding site for given
factor
13Mentioned other models are special cases for r 1
- Independent identically distributed (bootstrap)
- Stationary Markov
- Uniform displacement of start sites (Homogeneous
Poisson Process)
14Is the Effect Serious?
Example Statistic Overlap between two features
in a binary sequence of 10K bases
(region statistic in the
ENCODE studies) Feature 1 occurrence of
motif 111000 Feature 2 more than six 1s in 10
consecutive bases
- Ordinary bootstrap
- Base-by-base sampling randomly from observed
sequence for two features separately - Feature randomization
- Keep one type of feature fixed and randomizing
the start positions of the other
15Evidence for Segmented Stationarity
- DNA sequence is known to be inhomogeneous
- However, it has been segmented into homogeneous
domains based on - Base composition (e.g. finding Isochores)
- CpG density
- Density of higher order features (e.g. ORFS,
palindromes, TFBS) - Our model aims to capture these domain-specific
effects, while avoiding parametric assumptions
within domains
Figure from Li, 2001
References Elton (1974, J. Theoretical Bio.),
Braun and Müller (1998, Statistical Science), Li
et al. (1998, Genome Res.), Liu and Lawrence
(1999, Bioinformatics)
16Inference with our model
- Use X1,,Xn for basic data, but Xk could be base
identity, feature identity, a vector of feature
identities obeying segmented stationarity
assumption.
17Using our model for inference
Many genomic statistics are function of one or
more sums of the form e.g.
is 1 or 0 depending on the presence or absence of
a feature or features
- When the summands are small compared to S
- Gaussian case
- Example Region overlap for common
features, or rare features over large regions
Under segmented stationarity, these distributions
can be estimated from the data
18Distributions of feature overlaps
- The Block Bootstrap
- Cant observe independent occurrences of ENCODE
regions, but if our hypothesis of segmented
stationarity holds then the distribution of sum
statistics and their functions can be
approximated as follows
19Block Bootstrap for r 1
- Algorithm 4.1
- Given L ltlt n choose a number N uniformly at
random from - Given the statistics Tn(X1,,Xn) , under the
assumption that X1,,Xn is stationary, compute - Repeat B times to obtain
- Estimate the distribution of
by the empirical distribution - By Theorem 4.2.1 of Politis, Romano and Wolf
(1999)
20Block Bootstrap Animationr 1
Statistic
Observed Sequence (X)
Sf(X)
Draw a block of length L from original sequence,
this is the block-bootstrapped sequence.
Calculate statistic on the block bootstrapped
sequence.
Repeat this procedure identically B times.
21Observing the distributions
Block bootstrap distribution of the Region
Overlap Statistic
Shown here with the PDF of the normal
distribution with the same mean and variance
The histogram of
Is approximately the same as density of
QQplot of BB distribution vs. standard normal
22What if r gt 1
- The estimated distribution is always heavier
tailed leading to conservative p values - But it can be enormously so if the segment means
of the statistic differ substantially - Less so but still meaningful if the means agree
but variances differ
23Simulation Study
- For simplicity, we concatenate 2 homogeneous
regions generated as above
24Simulation Results and comparison to a naïve
method
25Solutions
- Segment using biological knowledge
- Essentially done in ENCODE poor segmentation
occasionally led to non-Gaussian distributions
(excessively conservative) - Segment using a particular linear statistic which
we expect to identify homogeneous segments
26Block Bootstrap with Segmentation
- Draw a block from each sub-segment and
concatenate to form a block bootstrap sample
27Block Bootstrap given Segmentation
f3L
f1L
f2L
1. Draw Subsample of length L
2. Compute statistic on subsample
T(X)
28Simulation Results, with segmentation
29Dyadic Segmentation
- Define,
- Find jmax maximizing M(j) creating intervals
Ileft and Iright - If length of both intervals falls below a
stopping criterion, stop - Else, repeat process for Ileft and/or Iright,
whichever are longer than stopping criterion,
with redefined M(j)
30Dyadic Segmentation
31(No Transcript)
32Confidence Bounds r gt 1
- Given a statistic, e.g. basepair overlap
Find
such that
as small as possible
Average basepair overlap over all potential
genomes for the region considered
33Use Algorithm 4.1
- For each segment pick random block of length
proportional to segment length - Concatenate to get block of length L
- Compute bp overlap for block
- Repeat many times
- Use 100(1-a) percentiles of this for
34Testing Association
- Question How do we estimate null distribution
given only data for which we believe the null is
false?
35Testing Association (bp overlap)
Observed Sequence (Feature 1 ,
Feature 2 )
Statistic is (X2)(Y1)(X1)(Y2), properly
normalized and set to mean 0. Under the null
hypothesis of independence, this should be
Gaussian.
Align Feature 1 of first block with Feature 2 of
second block, And vice versa.
Calculate overlap in the blocks after swapping
(X2)(Y1)(X1)(Y2)
Sample two blocks of equal length.
36Test Statistic
- H Features not associated in each segment
(so-called dummy overlap) - Then has a Gaussian
distribution. -
- We form the test statistic
-
-
where
Length of segment i/n of basepairs in segment i
identified as Feature 1 of basepairs in segment
i identified as Feature 2
37Null Distribution
- Choose pairs of blocks at random
- Compute false (dummy) overlap H
- Compute I Feature 1 and J Feature 2
- Block bootstrapped Null H IJ
- If r gt 1, pairs of blocks are chosen in each
region, H and IJ are weighted sums across
regions. - The Null is mean zero, and has the correct
variance
38Example from ENCODE data
- ENm001 ENCODE Consortium annotated over 2500
feature-instances exclusive of UTRs and CDSs - Question Do these (largely) non-coding features
exhibit more overlap with constrained sequences
than expected at random? - To answer, we used the block bootstrap to obtain
null distribution - When null is Gaussian, it has the correct
variance - When not, it is overly conservative
- Segmentation can reduce conservativeness, and
detect significance that would otherwise be missed
39(No Transcript)
40There are two Ls
- Ls the minimum segment length during
segmentation - To be discussed
- L the length of blocks during subsamling
- Chosen on grounds of stability
41A philosophical questionThe Issue of Scale
- Relevant probability assessments depend on
segmentation - Segmentation depends on scale
- Things which seem surprising on small scales, may
not be at larger ones - E.g. differences in GC content
My view Its only determinable biologically
42Some Future Directions
- KS type tests
- Beyond overlap, KS-type tests can compare the
distributions of features, e.g. Does the pattern
of constrained sequence in coding regions differ
from that in non-coding regions? - Maxima
- Aggregative plots can summarize one feature in
the neighborhood of another, e.g. Does binding
data (such as Chip-chIP) show that a given
regulatory factor tends to bind near TSSs? - Other types of association
- Does wavelet analysis offer significant support
for the large scale association of replication
timing and conservation? - Many others arising from ENCODE, modENCODE, and
elsewhere - Other types of segmentation
- Dyadic segmentation is analytically convenient,
but other segmentations may be useful
43Acknowledgements
- The ENCODE Consortium
- The MSA and Transcription and Regulation Groups
- Especially Elliot Margulies, Tom Gingeras and
Ewan Birney - Supported by NIGMS and NHGRI
44(No Transcript)
45Association of functional annotations in Human
Genome
Table from ENCODE Consortium Paper Nature, June
14th, 2007
46Dyadic Segmentation
Algorithm 4.8
Algorithm 4.8
For a minimum region length Ls and threshold b
initialize
- For i 1,,t-1, let M(i)(j) and V(i)(j) be
respectively the processes (4.7) and (4.8)
computed on the subsequence Xti-11,,Xti. Let
ti argmaxjM(i)(k), and mi min(ti ti-1,ti
- ti). Let - Let Vi V(i)(ti). Let
If
stop, return t. - Let i argmaxi Bi , and tnew ti
- Let t t ? tnew reordered so that ti is
monotonically increasing in i.
47(No Transcript)