Title: Functional Genomics with Next-Generation Sequencing
1Functional Genomics with Next-Generation
Sequencing
- Jen Taylor
- Bioinformatics Team
- CSIRO Plant Industry
2Capacity and Resolution
- Next generation sequencing
- Increasing capacity leads to increased resolution
Eric Lander, Broad Institute
3How a Genome Works?
- Parts Description
- Function?
- Interconnectedness?
- Comparisons
- Population - level
- Between genomes
4Application domains
- Reference genome
- No Reference Genome
- Partially sequenced
- UNsequenced
- PUN Genomes
5Impact of a Reference Genome
Sequence Data
Alignment
Genome
Read Density
Characterisation
6Applications of Next Generation Sequencing
- Profiling of Variation
- Genetic variation
- Transcript variation
- Epigenetic variation
- Metagenomic variation
- Discovery
- Novel genomes
- Novel genes
- Novel transcripts
- Small / long non-coding RNA
7RNASeq
- Qualitative transcript diversity
- Quantitative transcript abundance
- Impact of NGS
- Observation of transcript complexity
- Transcript discovery
- Small / long non-coding RNA
- Analytical challenges
- Transcript complexity
- Compositional properties
8RNASeq
Sample Total RNA PolyA RNA Small RNA
Reference
Analysis
Mapping to Genome
Digital Counts Reads per kilobase per million
(RPKM) Transcript structure Secondary
structure Targets or Products
Library Construction
PUN
Assembly to Contigs
Sequencing
Base calling QC
9RNASeq Transcript Complexity
- Mapping
- Reads with multiple locations
- Conserved domains ?
- Sequencing error ?
- Reads Spanning Exons
- Gapped alignments ?
- Sequencing error ?
Erange Pipeline Mortazavi et al., Nature
Methods VOL.5 NO.7 JULY 2008
10RNASeq Compositional properties
- Depth of Sequence
- Sequence count Transcript Abundance
- Majority of the data can be dominated by a small
number of highly abundant transcripts - Ability to observe transcripts of smaller
abundance is dependent upon sequence depth
11RNASeq Compositional properties
- Composition
- Sequence counts are a composition of a fixed
number of total sequence reads - Therefore they are sum-constrained and not
independent - Large variations in component numbers and sizes
can produce artefacts
True Reads
RPKM
12RNASeq - Correspondence
- Good correspondence with
- Expression Arrays
- Tiling Arrays
- qRT-PCR
- Range of up to 5 orders of magnitude
- Better detection of low abundance transcripts
- Greater power to detect
- Transcript sequence polymorphism
- Novel trans-splicing
- Paralogous genes
- Individual cell type expression
13Reference Genome - RNASeq
14Reference Genome - RNASeq
Human Exome Number of exons targeted 180,000
(CCDS database) plus700 miRNA(Sanger v13) 300
ncRNA
15Epigenome
- Protein-DNA interactions ChIPSeq
- Nucleosome positioning
- Histone modification
- Transcription factor interactions
- Methylation MethylSeq
- Impact of NextGen
- Whole genome profiling
- Resolution
- Analytical challenges
- Systematic bias
- Unambiguous mapping
- Robust event calling
Image ClearScience
16ChIPSeq
MNase Linker Digest
17ChIPSeq
MNase Digest
Remove Nucleosomes
18ChipSeq methods
CisGenome
ERANGE
FindPeaks
F-Seq
GLITR
MACS
PeakSeq
QuEST
Pepke et al., 2009
19MethylSeq using Bisulfite conversion
20Limited publications from BS-Seq
- Mammals
- Methylation predominant occurs at CpG site
- Several publications in human
- One publications in mouse
- Plants
- Methylation occurs at CG, CHH, CHG sites
- Two publications in arabidopsis
H A, G, T
21Problems of mapping BS-seq reads
- Reduced sequence complexity
gtgtA C G T T T T T T A G T Tgtgt
22Problems of mapping BS-seq reads
Watson gtgt A Cm G T T C T C C A G T C gtgt Crick
ltlt T G Cm A A G A G G T C A G ltlt
23ELAND
- Mapping reads to genome sequences
- Mapping reads to two converted genome sequences
- Cross match for reads mapping to multiple
positions in converted genomes - Mapping results were combined to generate
methylation information - Eland only allows 2 mismatches.
Lister et al. Cell (2008)
24BSMAP
- Based on HASH table seeding algorithm
Xi and Li BMC Bioinformatics (2009)
25Re-mapping of Listers data using BSMAP
Raw Reads Methods Uniquely Mapped Reads Unique and Nonclonal Reads Unique and nonclonal reads
144,704,372 Eland 55,805,931 39,113,599 27.03
144,704,372 BSMAP 67,975,425 48,498,687 35.52
Lister et al. Cell (2008)
26Methylation pattern throughout chromosomes
27Partially / Unsequenced Genomes
- Options for dealing with partial or unsequenced
genomes - Wait for or generate the genome sequence
- Borrow a reference genome from a phylogenetic
neighbour - Take a deep breath and do denovo
- Denovo Genome
- Denovo Transcriptome
Gene Annotation
DNA or RNA Sequence Data
Genetic Variation
Partial Assembly
Transcript Variation
Partial Sequence Database
Non-coding RNA
28Plant Genomes Haploid Size
Human Arabidopsis Rice Potato Sugarcane Cotton Ba
rley
Wheat
Diameter proportional to genome haploid genome
size
29Plant Genomes Total Size
Human Cotton
Barley Sugarcane
30Denovo RNA Seq
- Why transcriptome ?
- Large genome sizes with high repeat content are
difficult to assemble - Transcriptomes more constant size
- Enriched for functional content
- Aims
- Transcript discovery
- Small /long non-coding RNA profiling
- Analytical challenges
- Assembly ABySS, Velvet, Euler-SR
- Comparisons between non-discrete, overlapping
transcripts - Annotation
- Ploidy
31Summary Impacts and Challenges
- RNASeq
- Increased resolution
- Increased power for transcript complexity and
variation - Analytical challenges transcript complexity,
compositional bias - Large gains in small and long non-coding RNA
profiling - Epigenomics
- ChipSeq and MethylSeq
- Genome-wide with resolution
- Robust event calling is challenging
- Denovo transcriptomics
- Attractive option for large, repeat rich genomes
32Acknowledgements
CSIRO PI Bioinformatics Team Andrew
Spriggs Stuart Stephen Emily Ying Jose
Robles Michael James CSIRO Biostatistics David
Lovell