Data Analysis for High-Throughput Sequencing - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Data Analysis for High-Throughput Sequencing

Description:

... variants ChIP-Seq for DNA-binding proteins or less common histone marks Allele-specific expression COMING SOON DNA methylation ... Seq assays RNA-Seq ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 28
Provided by: mrei4
Category:

less

Transcript and Presenter's Notes

Title: Data Analysis for High-Throughput Sequencing


1
Data Analysis for High-Throughput Sequencing
  • Mark Reimers
  • Tobias Guennel
  • Department of Biostatistics

2
Unto the Frontiers of Ignorance
  • I love the way this workshop starts off with
    things we understand fairly well and works up to
    the cutting edge of things we dont understand at
    all
  • - Mike Neale, Oct 14, 2010

3
The New Boyfriend/Girlfriend
4
Where Does HTS Really Make the Difference?
  • Sequencing for novel variants
  • ChIP-Seq for DNA-binding proteins or less common
    histone marks
  • Allele-specific expression
  • COMING SOON
  • DNA methylation

5
Outline
  • Biases in reads
  • RNA-Seq
  • normalization
  • basic tests
  • differential splicing
  • Finding peaks in ChIP-Seq

6
Technical Biases Sequence Start
The initial bases of reads are highly biased, and
the bias depends on RNA/DNA preparation
7
Sequence Biases K-mers Differ
  • (Schroeder et al, PLoS One, 2010) calculated
    proportions of words (k-mers) starting at various
    positions

Expected frequencies if bases random
8
Position of single mismatch in uniquely mapped
tags
Courtesy Jean Danielle Thierry-Mieg
9
Types of mismatches in uniquely mapped tags with
a single mismatch are profoundly asymmetric and
biased
Courtesy Jean Danielle Thierry-Mieg
10
Technical Biases Initiation Sites
COX1
11
Different Platforms Have Different Biases
  • (Harismendy et al, Genome Biology, 2009)
    sequenced a section of 4 HapMap individuals on
    Roche 454, on Illumina, and on SOLiD
  • 454 had most even coverage

12
Initiation Biases Dwarf Splicing
  • Counts of reads along gene APOE in different
    tissues of data from Wold lab. (a) Brain, (b)
    liver, (c) skeletal muscle

13
Variation in Technical Biases
  • Sometimes the initial base biases change
    substantially most base proportions change
    together one PC explains 95
  • In most preparations the initiation site biases
    change by a few percent
  • In a few preparations the initiation site biases
    change by 20-30
  • This may have consequences for representation in
    ChIP-Seq assays

14
RNA-Seq Data Analysis
15
Biases in Proportions
  • Fragments compete for real-estate on the lane
  • If a few dozen genes are highly expressed in one
    tissue, they will competitively inhibit the
    sequencing of other genes, resulting in what
    appears to be lower expression

16
Effects of Competition
  • (Robinson Oshlak, Genome Biology, 2010)

17
A Simple Normalization
  • Align the medians of the housekeeping genes, or
    the genes that are not expressed at very high
    levels in any sample, across the samples

18
A Simple Model for Counts
  • Poisson distribution of counts within a gene with
    mean proportional to Np
  • SD of variation equal to square root of Np
  • Problem Actual variation of counts between
    replicate samples is significantly higher than
    root Np
  • Probably reflecting systematic biases

19
Hacks for Over-Dispersion
  • Like l fudge-factor in GWAS
  • Use negative binomial model
  • There is no relation to meaning of distribution
    numbers of nulls until something happens
  • Convenient way to parametrise over-dispersion
  • Bioconductor package edgeR estimates parameters
    by Maximum Likelihood

20
Alternate Transcripts Splicing Index
  • For each exon, the proportion of transcripts in
    which the exon appears
  • Hard to estimate because different exons have
    different representation probabilities
  • Use ratios of exons
  • Use constitutive exons (if known) as baseline
    for them SI1

from Wang et al, Nature, 2008
21
Detecting Alternate Splicing I
  • (Wang et al, Nature, 2008) measured splicing
    index for several tissues

22
Splicing Junction Reads
  • Some reads will span two different exons
  • Need long enough reads to be able to reliably map
    both sides
  • Can use information from one exon to identify
    gene and restrict possibilities for 5 end other
    exon

from Wang et al NAR 2010
23
ChIP-Seq
24
Courtesy Raphael Gottardo
25
A View of ChIP-Seq Data
  • Typically reads are quite sparsely distributed
    over the genome
  • Controls (i.e. no pull-down by antibody) often
    show smaller peaks at the same locations
  • Probably due to open chromatin at promoter

Rozowsky et al Nature Methods, 2009
26
Always Have a Control
  • High correlation between peaks in control samples
    and peaks in ChIP sample
  • Must subtract estimate of background from control
    tags

From Zhang et al, Genome Biol 2008
27
Locating Binding Sites
  • Use the fact that reads on opposite sides of the
    site represent are sequenced in opposite senses

From Zhao et al NAR 2009
Write a Comment
User Comments (0)
About PowerShow.com