Peter J' Bickel - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Peter J' Bickel

Description:

... there can be clumping caused by the complex underlying genome sequence structure ... ENm001: ENCODE Consortium annotated over 2500 feature-instances ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 48
Provided by: hai98
Category:

less

Transcript and Presenter's Notes

Title: Peter J' Bickel


1
Refined Non Parametric Methods for Genomic
inference
  • Peter J. Bickel
  • Department of Statistics
  • University of California at Berkeley, USA

Joint work with Nancy R. Zhang (Stanford), James
B. Brown (UCB) and Haiyan Huang (UCB)
2
Motivating Questions
3
Association of functional annotations in Human
Genome
? Transcription Start Sites (TSSs)
5'
3'
3'
5'
  • The ENCODE Consortium found that many
    Transcription Start Sites are anti-sense to
    GENCODE exons
  • They also found vastly more TSSs than previously
    supposed
  • Is the association between TSSs and exons in the
    anti-sense direction real, or experimental noise
    in TSS identification?

4
Association of experimental annotations across
whole chromosomes
Do two factors tend to bind together more closely
or more often than other pairs of factors? Does
a factors binding site relative to TSSs tend to
change across genomic regions?
5
The statistical relation of Transcription Start
Sites and protein binding sites
Figure from ENCODE Consortium Paper Nature, June
14th, 2007
Normalized signal intensity
Enchancer activity?
  • Normalized Chip-chIP signals around GENCODE TSSs
    in ENCODE regions
  • Most peak over the TSS and are clearly
    significant
  • Does the upstream bump in CTCF constitute good
    evidence of enchancer binding activity?

6
What is a non-parametric model for the Genome and
why is it needed?
7
Feature Overlap the question
  • A mathematical question arises
  • Do these features overlap more, or less than
    expected at random?

5'
3'
8
Our formulation
  • Defining expectation and at random
  • The genome is highly structured
  • Analysis of feature inter-dependence must account
    for superficial structure
  • Expected at random becomes
  • Overlap between two feature sets bearing
    structure, under no biological constraints

9
Naïve Method
  • Treating bases as being independent with same
    distribution (ordinary bootstrap)
  • Hypothesis Feature markings are independent
  • Specific Object Test based on
  • Feature Overlap ( Feature1)( Feature2)
  • and standard statistics
  • Why naïve ? Bases are NOT independent
  • Better method keeping one type of feature fixed
    and simulating moving start site of another
    feature uniformly (feature bootstrap)
  • Why still a problem?
  • Even if feature occurrences are independent
    functionally, there can be clumping caused by the
    complex underlying genome sequence structure
  • (i.e. inhomogeneity, local sequence
    dependence)

10
A non parametric model
  • Requirements
  • It should roughly reflect known statistics of the
    genome
  • It should encompass methods listed
  • It should be possible to do inference, tests, set
    confidence bounds meaningfully

11
Segmented Stationary Model
  • Let Xi base at position i, i1,,n
  • such that for each k1,,r,
    is
  • Stationary (homogeneity within blocks)
  • Mixing (bases at distant positions are nearly
    independent)
  • r ltlt n


12
Empirical Interpretations
  • Within a segment
  • For k small compared to minimum segment length,
    statistics of random kmers do not differ between
    large subsegments of segment
  • Knowledge of the first kmer does not help in
    predicting a distant kmer
  • Remark
  • If this model holds it also applies to derived
    local features, e.g. I1,,In where Ik 1 if
    position k belongs to binding site for given
    factor

13
Mentioned other models are special cases for r 1
  • Independent identically distributed (bootstrap)
  • Stationary Markov
  • Uniform displacement of start sites (Homogeneous
    Poisson Process)

14
Is the Effect Serious?
Example Statistic Overlap between two features
in a binary sequence of 10K bases
(region statistic in the
ENCODE studies) Feature 1 occurrence of
motif 111000 Feature 2 more than six 1s in 10
consecutive bases
  • Ordinary bootstrap
  • Base-by-base sampling randomly from observed
    sequence for two features separately
  • Feature randomization
  • Keep one type of feature fixed and randomizing
    the start positions of the other

15
Evidence for Segmented Stationarity
  • DNA sequence is known to be inhomogeneous
  • However, it has been segmented into homogeneous
    domains based on
  • Base composition (e.g. finding Isochores)
  • CpG density
  • Density of higher order features (e.g. ORFS,
    palindromes, TFBS)
  • Our model aims to capture these domain-specific
    effects, while avoiding parametric assumptions
    within domains

Figure from Li, 2001
References Elton (1974, J. Theoretical Bio.),
Braun and Müller (1998, Statistical Science), Li
et al. (1998, Genome Res.), Liu and Lawrence
(1999, Bioinformatics)
16
Inference with our model
  • Use X1,,Xn for basic data, but Xk could be base
    identity, feature identity, a vector of feature
    identities obeying segmented stationarity
    assumption.

17
Using our model for inference
Many genomic statistics are function of one or
more sums of the form e.g.
is 1 or 0 depending on the presence or absence of
a feature or features
  • When the summands are small compared to S
  • Gaussian case
  • Example Region overlap for common
    features, or rare features over large regions

Under segmented stationarity, these distributions
can be estimated from the data
18
Distributions of feature overlaps
  • The Block Bootstrap
  • Cant observe independent occurrences of ENCODE
    regions, but if our hypothesis of segmented
    stationarity holds then the distribution of sum
    statistics and their functions can be
    approximated as follows

19
Block Bootstrap for r 1
  • Algorithm 4.1
  • Given L ltlt n choose a number N uniformly at
    random from
  • Given the statistics Tn(X1,,Xn) , under the
    assumption that X1,,Xn is stationary, compute
  • Repeat B times to obtain
  • Estimate the distribution of
    by the empirical distribution
  • By Theorem 4.2.1 of Politis, Romano and Wolf
    (1999)

20
Block Bootstrap Animationr 1
Statistic
Observed Sequence (X)
Sf(X)



Draw a block of length L from original sequence,
this is the block-bootstrapped sequence.
Calculate statistic on the block bootstrapped
sequence.
Repeat this procedure identically B times.
21
Observing the distributions
Block bootstrap distribution of the Region
Overlap Statistic
Shown here with the PDF of the normal
distribution with the same mean and variance
The histogram of
Is approximately the same as density of
QQplot of BB distribution vs. standard normal
22
What if r gt 1
  • The estimated distribution is always heavier
    tailed leading to conservative p values
  • But it can be enormously so if the segment means
    of the statistic differ substantially
  • Less so but still meaningful if the means agree
    but variances differ

23
Simulation Study
  • For simplicity, we concatenate 2 homogeneous
    regions generated as above

24
Simulation Results and comparison to a naïve
method
25
Solutions
  • Segment using biological knowledge
  • Essentially done in ENCODE poor segmentation
    occasionally led to non-Gaussian distributions
    (excessively conservative)
  • Segment using a particular linear statistic which
    we expect to identify homogeneous segments

26
Block Bootstrap with Segmentation
  • Draw a block from each sub-segment and
    concatenate to form a block bootstrap sample

27
Block Bootstrap given Segmentation
f3L
f1L
f2L
1. Draw Subsample of length L
2. Compute statistic on subsample
T(X)
28
Simulation Results, with segmentation
29
Dyadic Segmentation
  • Define,
  • Find jmax maximizing M(j) creating intervals
    Ileft and Iright
  • If length of both intervals falls below a
    stopping criterion, stop
  • Else, repeat process for Ileft and/or Iright,
    whichever are longer than stopping criterion,
    with redefined M(j)

30
Dyadic Segmentation
31
(No Transcript)
32
Confidence Bounds r gt 1
  • Given a statistic, e.g. basepair overlap

Find
such that
as small as possible
Average basepair overlap over all potential
genomes for the region considered
33
Use Algorithm 4.1
  • For each segment pick random block of length
    proportional to segment length
  • Concatenate to get block of length L
  • Compute bp overlap for block
  • Repeat many times
  • Use 100(1-a) percentiles of this for

34
Testing Association
  • Question How do we estimate null distribution
    given only data for which we believe the null is
    false?

35
Testing Association (bp overlap)
Observed Sequence (Feature 1 ,
Feature 2 )
Statistic is (X2)(Y1)(X1)(Y2), properly
normalized and set to mean 0. Under the null
hypothesis of independence, this should be
Gaussian.
Align Feature 1 of first block with Feature 2 of
second block, And vice versa.
Calculate overlap in the blocks after swapping
(X2)(Y1)(X1)(Y2)
Sample two blocks of equal length.
36
Test Statistic
  • H Features not associated in each segment
    (so-called dummy overlap)
  • Then has a Gaussian
    distribution.
  • We form the test statistic

  • where

Length of segment i/n of basepairs in segment i
identified as Feature 1 of basepairs in segment
i identified as Feature 2
37
Null Distribution
  • Choose pairs of blocks at random
  • Compute false (dummy) overlap H
  • Compute I Feature 1 and J Feature 2
  • Block bootstrapped Null H IJ
  • If r gt 1, pairs of blocks are chosen in each
    region, H and IJ are weighted sums across
    regions.
  • The Null is mean zero, and has the correct
    variance

38
Example from ENCODE data
  • ENm001 ENCODE Consortium annotated over 2500
    feature-instances exclusive of UTRs and CDSs
  • Question Do these (largely) non-coding features
    exhibit more overlap with constrained sequences
    than expected at random?
  • To answer, we used the block bootstrap to obtain
    null distribution
  • When null is Gaussian, it has the correct
    variance
  • When not, it is overly conservative
  • Segmentation can reduce conservativeness, and
    detect significance that would otherwise be missed

39
(No Transcript)
40
There are two Ls
  • Ls the minimum segment length during
    segmentation
  • To be discussed
  • L the length of blocks during subsamling
  • Chosen on grounds of stability

41
A philosophical questionThe Issue of Scale
  • Relevant probability assessments depend on
    segmentation
  • Segmentation depends on scale
  • Things which seem surprising on small scales, may
    not be at larger ones
  • E.g. differences in GC content

My view Its only determinable biologically
42
Some Future Directions
  • KS type tests
  • Beyond overlap, KS-type tests can compare the
    distributions of features, e.g. Does the pattern
    of constrained sequence in coding regions differ
    from that in non-coding regions?
  • Maxima
  • Aggregative plots can summarize one feature in
    the neighborhood of another, e.g. Does binding
    data (such as Chip-chIP) show that a given
    regulatory factor tends to bind near TSSs?
  • Other types of association
  • Does wavelet analysis offer significant support
    for the large scale association of replication
    timing and conservation?
  • Many others arising from ENCODE, modENCODE, and
    elsewhere
  • Other types of segmentation
  • Dyadic segmentation is analytically convenient,
    but other segmentations may be useful

43
Acknowledgements
  • The ENCODE Consortium
  • The MSA and Transcription and Regulation Groups
  • Especially Elliot Margulies, Tom Gingeras and
    Ewan Birney
  • Supported by NIGMS and NHGRI

44
(No Transcript)
45
Association of functional annotations in Human
Genome
Table from ENCODE Consortium Paper Nature, June
14th, 2007
46
Dyadic Segmentation
Algorithm 4.8
Algorithm 4.8
For a minimum region length Ls and threshold b
initialize
  • For i 1,,t-1, let M(i)(j) and V(i)(j) be
    respectively the processes (4.7) and (4.8)
    computed on the subsequence Xti-11,,Xti. Let
    ti argmaxjM(i)(k), and mi min(ti ti-1,ti
    - ti). Let
  • Let Vi V(i)(ti). Let
    If
    stop, return t.
  • Let i argmaxi Bi , and tnew ti
  • Let t t ? tnew reordered so that ti is
    monotonically increasing in i.

47
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com