Peter J' Bickel - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Peter J' Bickel

Description:

... there can be clumping caused by the complex underlying genome sequence structure ... ENm001: ENCODE Consortium annotated over 2500 feature-instances ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 48

Provided by: hai98

Category:

more less

Transcript and Presenter's Notes

Title: Peter J' Bickel

1
Refined Non Parametric Methods for Genomic
inference

Peter J. Bickel
Department of Statistics
University of California at Berkeley, USA

Joint work with Nancy R. Zhang (Stanford), James
B. Brown (UCB) and Haiyan Huang (UCB)
2
Motivating Questions
3
Association of functional annotations in Human
Genome
? Transcription Start Sites (TSSs)
5'
3'
3'
5'

The ENCODE Consortium found that many
Transcription Start Sites are anti-sense to
GENCODE exons
They also found vastly more TSSs than previously
supposed
Is the association between TSSs and exons in the
anti-sense direction real, or experimental noise
in TSS identification?

4
Association of experimental annotations across
whole chromosomes
Do two factors tend to bind together more closely
or more often than other pairs of factors? Does
a factors binding site relative to TSSs tend to
change across genomic regions?
5
The statistical relation of Transcription Start
Sites and protein binding sites
Figure from ENCODE Consortium Paper Nature, June
14th, 2007
Normalized signal intensity
Enchancer activity?

Normalized Chip-chIP signals around GENCODE TSSs
in ENCODE regions
Most peak over the TSS and are clearly
significant
Does the upstream bump in CTCF constitute good
evidence of enchancer binding activity?

6
What is a non-parametric model for the Genome and
why is it needed?
7
Feature Overlap the question

A mathematical question arises
Do these features overlap more, or less than
expected at random?

5'
3'
8
Our formulation

Defining expectation and at random
The genome is highly structured
Analysis of feature inter-dependence must account
for superficial structure
Expected at random becomes
Overlap between two feature sets bearing
structure, under no biological constraints

9
Naïve Method

Treating bases as being independent with same
distribution (ordinary bootstrap)
Hypothesis Feature markings are independent
Specific Object Test based on
Feature Overlap ( Feature1)( Feature2)
and standard statistics
Why naïve ? Bases are NOT independent
Better method keeping one type of feature fixed
and simulating moving start site of another
feature uniformly (feature bootstrap)
Why still a problem?
Even if feature occurrences are independent
functionally, there can be clumping caused by the
complex underlying genome sequence structure
(i.e. inhomogeneity, local sequence
dependence)

10
A non parametric model

Requirements
It should roughly reflect known statistics of the
genome
It should encompass methods listed
It should be possible to do inference, tests, set
confidence bounds meaningfully

11
Segmented Stationary Model

Let Xi base at position i, i1,,n
such that for each k1,,r,
is
Stationary (homogeneity within blocks)
Mixing (bases at distant positions are nearly
independent)
r ltlt n

12
Empirical Interpretations

Within a segment
For k small compared to minimum segment length,
statistics of random kmers do not differ between
large subsegments of segment
Knowledge of the first kmer does not help in
predicting a distant kmer
Remark
If this model holds it also applies to derived
local features, e.g. I1,,In where Ik 1 if
position k belongs to binding site for given
factor

13
Mentioned other models are special cases for r 1

Independent identically distributed (bootstrap)
Stationary Markov
Uniform displacement of start sites (Homogeneous
Poisson Process)

14
Is the Effect Serious?
Example Statistic Overlap between two features
in a binary sequence of 10K bases
(region statistic in the
ENCODE studies) Feature 1 occurrence of
motif 111000 Feature 2 more than six 1s in 10
consecutive bases

Ordinary bootstrap
Base-by-base sampling randomly from observed
sequence for two features separately
Feature randomization
Keep one type of feature fixed and randomizing
the start positions of the other

15
Evidence for Segmented Stationarity

DNA sequence is known to be inhomogeneous
However, it has been segmented into homogeneous
domains based on
Base composition (e.g. finding Isochores)
CpG density
Density of higher order features (e.g. ORFS,
palindromes, TFBS)
Our model aims to capture these domain-specific
effects, while avoiding parametric assumptions
within domains

Figure from Li, 2001
References Elton (1974, J. Theoretical Bio.),
Braun and Müller (1998, Statistical Science), Li
et al. (1998, Genome Res.), Liu and Lawrence
(1999, Bioinformatics)
16
Inference with our model

Use X1,,Xn for basic data, but Xk could be base
identity, feature identity, a vector of feature
identities obeying segmented stationarity
assumption.

17
Using our model for inference
Many genomic statistics are function of one or
more sums of the form e.g.
is 1 or 0 depending on the presence or absence of
a feature or features

When the summands are small compared to S
Gaussian case
Example Region overlap for common
features, or rare features over large regions

Under segmented stationarity, these distributions
can be estimated from the data
18
Distributions of feature overlaps

The Block Bootstrap
Cant observe independent occurrences of ENCODE
regions, but if our hypothesis of segmented
stationarity holds then the distribution of sum
statistics and their functions can be
approximated as follows

19
Block Bootstrap for r 1

Algorithm 4.1
Given L ltlt n choose a number N uniformly at
random from
Given the statistics Tn(X1,,Xn) , under the
assumption that X1,,Xn is stationary, compute
Repeat B times to obtain
Estimate the distribution of
by the empirical distribution
By Theorem 4.2.1 of Politis, Romano and Wolf
(1999)

20
Block Bootstrap Animationr 1
Statistic
Observed Sequence (X)
Sf(X)

Draw a block of length L from original sequence,
this is the block-bootstrapped sequence.
Calculate statistic on the block bootstrapped
sequence.
Repeat this procedure identically B times.
21
Observing the distributions
Block bootstrap distribution of the Region
Overlap Statistic
Shown here with the PDF of the normal
distribution with the same mean and variance
The histogram of
Is approximately the same as density of
QQplot of BB distribution vs. standard normal
22
What if r gt 1

The estimated distribution is always heavier
tailed leading to conservative p values
But it can be enormously so if the segment means
of the statistic differ substantially
Less so but still meaningful if the means agree
but variances differ

23
Simulation Study

For simplicity, we concatenate 2 homogeneous
regions generated as above

24
Simulation Results and comparison to a naïve
method
25
Solutions

Segment using biological knowledge
Essentially done in ENCODE poor segmentation
occasionally led to non-Gaussian distributions
(excessively conservative)
Segment using a particular linear statistic which
we expect to identify homogeneous segments

26
Block Bootstrap with Segmentation

Draw a block from each sub-segment and
concatenate to form a block bootstrap sample

27
Block Bootstrap given Segmentation
f3L
f1L
f2L
1. Draw Subsample of length L
2. Compute statistic on subsample
T(X)
28
Simulation Results, with segmentation
29
Dyadic Segmentation

Define,
Find jmax maximizing M(j) creating intervals
Ileft and Iright
If length of both intervals falls below a
stopping criterion, stop
Else, repeat process for Ileft and/or Iright,
whichever are longer than stopping criterion,
with redefined M(j)

30
Dyadic Segmentation
31
(No Transcript)
32
Confidence Bounds r gt 1

Given a statistic, e.g. basepair overlap

Find
such that
as small as possible
Average basepair overlap over all potential
genomes for the region considered
33
Use Algorithm 4.1

For each segment pick random block of length
proportional to segment length
Concatenate to get block of length L
Compute bp overlap for block
Repeat many times
Use 100(1-a) percentiles of this for

34
Testing Association

Question How do we estimate null distribution
given only data for which we believe the null is
false?

35
Testing Association (bp overlap)
Observed Sequence (Feature 1 ,
Feature 2 )
Statistic is (X2)(Y1)(X1)(Y2), properly
normalized and set to mean 0. Under the null
hypothesis of independence, this should be
Gaussian.
Align Feature 1 of first block with Feature 2 of
second block, And vice versa.
Calculate overlap in the blocks after swapping
(X2)(Y1)(X1)(Y2)
Sample two blocks of equal length.
36
Test Statistic

H Features not associated in each segment
(so-called dummy overlap)
Then has a Gaussian
distribution.
We form the test statistic
where

Length of segment i/n of basepairs in segment i
identified as Feature 1 of basepairs in segment
i identified as Feature 2
37
Null Distribution

Choose pairs of blocks at random
Compute false (dummy) overlap H
Compute I Feature 1 and J Feature 2
Block bootstrapped Null H IJ
If r gt 1, pairs of blocks are chosen in each
region, H and IJ are weighted sums across
regions.
The Null is mean zero, and has the correct
variance

38
Example from ENCODE data

ENm001 ENCODE Consortium annotated over 2500
feature-instances exclusive of UTRs and CDSs
Question Do these (largely) non-coding features
exhibit more overlap with constrained sequences
than expected at random?
To answer, we used the block bootstrap to obtain
null distribution
When null is Gaussian, it has the correct
variance
When not, it is overly conservative
Segmentation can reduce conservativeness, and
detect significance that would otherwise be missed

39
(No Transcript)
40
There are two Ls

Ls the minimum segment length during
segmentation
To be discussed
L the length of blocks during subsamling
Chosen on grounds of stability

41
A philosophical questionThe Issue of Scale

Relevant probability assessments depend on
segmentation
Segmentation depends on scale
Things which seem surprising on small scales, may
not be at larger ones
E.g. differences in GC content

My view Its only determinable biologically
42
Some Future Directions

KS type tests
Beyond overlap, KS-type tests can compare the
distributions of features, e.g. Does the pattern
of constrained sequence in coding regions differ
from that in non-coding regions?
Maxima
Aggregative plots can summarize one feature in
the neighborhood of another, e.g. Does binding
data (such as Chip-chIP) show that a given
regulatory factor tends to bind near TSSs?
Other types of association
Does wavelet analysis offer significant support
for the large scale association of replication
timing and conservation?
Many others arising from ENCODE, modENCODE, and
elsewhere
Other types of segmentation
Dyadic segmentation is analytically convenient,
but other segmentations may be useful

43
Acknowledgements

The ENCODE Consortium
The MSA and Transcription and Regulation Groups
Especially Elliot Margulies, Tom Gingeras and
Ewan Birney
Supported by NIGMS and NHGRI

44
(No Transcript)
45
Association of functional annotations in Human
Genome
Table from ENCODE Consortium Paper Nature, June
14th, 2007
46
Dyadic Segmentation
Algorithm 4.8
Algorithm 4.8
For a minimum region length Ls and threshold b
initialize

For i 1,,t-1, let M(i)(j) and V(i)(j) be
respectively the processes (4.7) and (4.8)
computed on the subsequence Xti-11,,Xti. Let
ti argmaxjM(i)(k), and mi min(ti ti-1,ti
- ti). Let
Let Vi V(i)(ti). Let
If
stop, return t.
Let i argmaxi Bi , and tnew ti
Let t t ? tnew reordered so that ti is
monotonically increasing in i.

47
(No Transcript)

Write a Comment

User Comments (0)