Title: Data Analysis for High-Throughput Sequencing
1Data Analysis for High-Throughput Sequencing
- Mark Reimers
- Tobias Guennel
- Department of Biostatistics
2Unto the Frontiers of Ignorance
- I love the way this workshop starts off with
things we understand fairly well and works up to
the cutting edge of things we dont understand at
all - - Mike Neale, Oct 14, 2010
3The New Boyfriend/Girlfriend
4Where Does HTS Really Make the Difference?
- Sequencing for novel variants
- ChIP-Seq for DNA-binding proteins or less common
histone marks - Allele-specific expression
- COMING SOON
- DNA methylation
5Outline
- Biases in reads
- RNA-Seq
- normalization
- basic tests
- differential splicing
- Finding peaks in ChIP-Seq
6Technical Biases Sequence Start
The initial bases of reads are highly biased, and
the bias depends on RNA/DNA preparation
7Sequence Biases K-mers Differ
- (Schroeder et al, PLoS One, 2010) calculated
proportions of words (k-mers) starting at various
positions
Expected frequencies if bases random
8Position of single mismatch in uniquely mapped
tags
Courtesy Jean Danielle Thierry-Mieg
9Types of mismatches in uniquely mapped tags with
a single mismatch are profoundly asymmetric and
biased
Courtesy Jean Danielle Thierry-Mieg
10Technical Biases Initiation Sites
COX1
11Different Platforms Have Different Biases
- (Harismendy et al, Genome Biology, 2009)
sequenced a section of 4 HapMap individuals on
Roche 454, on Illumina, and on SOLiD - 454 had most even coverage
12Initiation Biases Dwarf Splicing
- Counts of reads along gene APOE in different
tissues of data from Wold lab. (a) Brain, (b)
liver, (c) skeletal muscle
13Variation in Technical Biases
- Sometimes the initial base biases change
substantially most base proportions change
together one PC explains 95 - In most preparations the initiation site biases
change by a few percent - In a few preparations the initiation site biases
change by 20-30 - This may have consequences for representation in
ChIP-Seq assays
14RNA-Seq Data Analysis
15Biases in Proportions
- Fragments compete for real-estate on the lane
- If a few dozen genes are highly expressed in one
tissue, they will competitively inhibit the
sequencing of other genes, resulting in what
appears to be lower expression
16Effects of Competition
- (Robinson Oshlak, Genome Biology, 2010)
17A Simple Normalization
- Align the medians of the housekeeping genes, or
the genes that are not expressed at very high
levels in any sample, across the samples
18A Simple Model for Counts
- Poisson distribution of counts within a gene with
mean proportional to Np - SD of variation equal to square root of Np
- Problem Actual variation of counts between
replicate samples is significantly higher than
root Np - Probably reflecting systematic biases
19Hacks for Over-Dispersion
- Like l fudge-factor in GWAS
- Use negative binomial model
- There is no relation to meaning of distribution
numbers of nulls until something happens - Convenient way to parametrise over-dispersion
- Bioconductor package edgeR estimates parameters
by Maximum Likelihood
20Alternate Transcripts Splicing Index
- For each exon, the proportion of transcripts in
which the exon appears - Hard to estimate because different exons have
different representation probabilities - Use ratios of exons
- Use constitutive exons (if known) as baseline
for them SI1
from Wang et al, Nature, 2008
21Detecting Alternate Splicing I
- (Wang et al, Nature, 2008) measured splicing
index for several tissues
22Splicing Junction Reads
- Some reads will span two different exons
- Need long enough reads to be able to reliably map
both sides - Can use information from one exon to identify
gene and restrict possibilities for 5 end other
exon
from Wang et al NAR 2010
23ChIP-Seq
24Courtesy Raphael Gottardo
25A View of ChIP-Seq Data
- Typically reads are quite sparsely distributed
over the genome - Controls (i.e. no pull-down by antibody) often
show smaller peaks at the same locations - Probably due to open chromatin at promoter
Rozowsky et al Nature Methods, 2009
26Always Have a Control
- High correlation between peaks in control samples
and peaks in ChIP sample - Must subtract estimate of background from control
tags
From Zhang et al, Genome Biol 2008
27Locating Binding Sites
- Use the fact that reads on opposite sides of the
site represent are sequenced in opposite senses
From Zhao et al NAR 2009