DNA sequencing & microbial profiling. Multiple sequenc

About This Presentation

Title:

DNA sequencing & microbial profiling. Multiple sequenc

Description:

DNA sequencing & microbial profiling. Multiple sequence based options: Sequence tag surveys based on single marker genes Predominantly 16S rRNA prokaryotes ... – PowerPoint PPT presentation

Number of Views:419

Avg rating:3.0/5.0

Slides: 87

Provided by: academyNe

Category:

more less

Transcript and Presenter's Notes

Title: DNA sequencing & microbial profiling. Multiple sequenc

1

NESCent Summer School
Environmental tag sequencing and
metagenomics

Dr Konrad Paszkiewicz Exeter Sequencing Service,
University of Exeter, UK.
2
Environmental sequencing ? Metagenomics?
Wooley JC, Godzik A, Friedberg I, 2010 A Primer
on Metagenomics. PLoS Comput Biol 6(2)
3
Overview

What is environmental tag sequencing?
Why?
Methods
Operational Taxonomic Units
Measures of diversity
Other useful visualisations
Tag/marker sequencing
Metagenomics

4
Overview

What is metagenomics?
Why?
Case study 1
Assembly, ORFs and Gene finding
Annotation
Case study 2

5
DNA sequencing microbial profiling

Traditional microbiology relies on isolation and
culture of bacteria
Cumbersome and labour intensive process
Fails to account for the diversity of microbial
life
Great plate-count anomaly

Staley, J. T., and A. Konopka. 1985. Measurements
of in situ activities of nonphotosynthetic
microorganisms in aquatic and terrestrial
habitats. Annu. Rev. Microbiol. 39321-346
6
Why environmental sequencing?

Only a small proportion of organisms have been
grown in culture
Species do not live in isolation
Clonal cultures fail to represent the natural
environment of a given organism
Many proteins and protein functions remain
undiscovered

7
Why environmental sequencing?
Estimated 1000 trillion tons of bacterial/archeal
life on Earth Most organisms are difficult to
grow in culture Recent discovery of a
completely new fungal group Cryptomycota
Discovered by environmental sequencing of
Exeter University's pond water...
Jones, M. D. M. et al. Nature doi10.1038/nature09
984 (2011).
8
Why environmental sequencing?
Turnbaugh et al. 2006 An obesity associated gut
microbiome with increased capacity for energy
harvest. Nature 444 1027-1031
9
Results translate to humans
10x more bacterial cells than human 100-fold
more unique genes
Ley et al. 2006 Human Gut Microbiomes associated
with obesity. Nature 444 1022-1023
10
Overview

What is environmental sequencing?
Why?
Methods
Operational Taxonomic Units
Measures of diversity
Other useful visualisations
Tag/marker sequencing
Metagenomics

11
DNA sequencing microbial profiling
Multiple sequence based options Sequence
tag surveys based on single marker genes
Predominantly 16S rRNA prokaryotes, 18S rRNA for
eukaryotes Other genes such as rpoB also be
used. Initially done with cloning step and
Sanger sequencing (can generate
sequences that cover the full-length of the
gene) 454 pyrosequencing now the most widely
used approach (shorter reads but greater
depth) Illumina can also be used with
overlapping paired-end reads for even
shorter reads but 100x greater depth than 454
First trials with PacBio system (1-20kb but only
50,000 seqs/run) Metagenomics Single-cell
genomics
12
Sequencing costs
Courtesy Greg Caporaso
13
16S rRNA sequencing

16S rRNA forms part of bacterial ribosomes.
Contains regions of highly conserved and highly
variable sequence.
Variable sequence can be thought of as a
molecular fingerprint.can be used to identify
bacterial genera and species.
Large public databases available for
comparison.Ribosomal Database Project currently
contains gt1.5 million rRNA sequences.
Conserved regions can be targeted to amplify
broad range of bacteria from environmental
samples.
Not quantitative due to copy number variation

Erlandsen S L et al. J Histochem Cytochem
200553917-927
Circumvents the need to culture
Alan Walker, Sanger
14
16S sequencing redefined the tree of life
Woese C, Fox G (1977). "Phylogenetic structure of
the prokaryotic domain the primary kingdoms.".
Proc Natl Acad Sci USA 74 (11) 508890. Woese C,
Kandler O, Wheelis M (1990). "Towards a natural
system of organisms proposal for the domains
Archaea, Bacteria, and Eucarya.". Proc Natl Acad
Sci USA 87 (12) 45769
15
Which hyper-variable regions to sequence?
E.coli 16S SSU rRNA hyper-variable regions
Region Position b.p.
V1 69-99 30
V2 137-242 105
V3 338-533 195
V4 576-682 106
V5 822-879 57
V6 967-1046 79
V7 1117-1173 56
V8 1243-1294 51
V9 1435-1465 30
A detailed analysis of 16S ribosomal RNA gene
segments for the diagnosis of pathogenic bacteria
J Microbiol Methods. 2007 May 69(2) 330339
A quantitative map of nucleotide substitution
rates in bacterial rRNA van der Peer et al
Nucleic Acids Research, 1996, Vol. 24, No. 17
33813391
16
454-based 16S amplicon sequencing
17
Using overlapping paired-end Illumina reads
100bp
100bp
200bp
18
Using overlapping paired-end Illumina reads
100bp
100bp
Merge reads into single high quality 150bp read
150bp
19
Using overlapping paired-end Illumina reads

150bp reads useful for sequencing of individual
variable regions (e.g. V3,V6)
Even single-end reads can be useful
Enables 3-120 million of reads per sample 100x
more than 454

20
Using overlapping paired-end Illumina reads

150bp reads useful for sequencing of individual
variable regions (e.g. V3,V6)
Even single-end reads can be useful
Enables 3-120 million of reads per sample

21
Overview

What is environmental sequencing?
Why?
Methods
Operational Taxonomic Units
Measures of diversity
Other useful visualisations
Tag/marker sequencing
Metagenomics

22
How do we define a species?
No single definition has satisfied all
naturalists yet every naturalist knows vaguely
what he means when he speaks of a species
Charles Darwin, On the Origin of Species, 1859
23
How do we define a species for tag data?

Species concept works for sexually reproducing
organisms
Breaks down when applied to bacteria and fungi
Plasmids
Horizontal gene transfer
Transposons/Viruses
Operational Taxonomic Unit (OTU)
An arbitrary definition of a taxonomic unit based
on sequence divergence
OTU definitions matter

24
How do we define a species for tag data?

Search for sequence similarity between 16S/18S
variable regions (e.g. V1-V3) or particular
genes (e.g rpoB)
These genes are house-keeping genes which are
less likely to be involved in horizontal
transfer
However, note that 16S/18S sequences are known
to have variable copy numbers which can bias
results

De Santis et al. Greengenes, a chimera-checked
16s rRNA gene database. Appl Env Microbiol 72
5069-5072 www.mlst.net
25
Binning tags

Tags may be analysed in one of two ways
Composition-based binning
Relies on comparisons of gross-features to
species/genus/families which share these features
GC content
Di/Tri/Tetra/... nucleotide composition
(kmer-based frequency comparison)
Codon usage statistics
Similarity-based binning
Requires that most sequences in a sample are
present in a reference database
Direct comparison of OTU sequence to a reference
database
Identity cut-off varies depending on resolution
required
Genus - 90
Family - 80
Species - 97
Multiple marker genes used for finer
sub-strain identification (MLST)
Too stringent cut-off selection will lead to
excessive diversity being reported
Sequencing errors
Sample prep issues

26
Binning tags
Sample 1
Sample 2
27
A word on the importance of clustering algorithms
The clustering algorithm used to determine
distances between OTUs determines the form of the
resulting phylogenetic tree
28
A word on the importance of clustering algorithms
Average neighbour clustering seems to give the
most robust results
29
Software for binning tags

Tags may be analysed in one of two ways
Composition-based binning
TETRA - Maximal-Order Markov Model
PhyloPythia Support Vector
Seeded Growing Self-Organising Maps (S-GSOM)
TETRA Codon based usage
Similarity-based binning
Requires that most sequences in a sample are
present in a primary or secondary reference
database
QIIME (Todays workshop!)
MEGAN (comparison against Blast NCBI NR)
Mothur
CARMA (comparison against PFAM)
Phymm
ARB (linked with Silva database)

Wooley et al. A Primer on Metagenomics, PLoS
Computational Biology, Feb 2010, Vol 6(2)
30
Sequence databases for 16S similarity-based
binning
31
Sequence databases for 16S similarity-based
binning
32
Sequence databases for 16S similarity-based
binning
33
Sequence databases for 16S similarity-based
binning
34
Overview

What is environmental sequencing?
Why?
Methods
Operational Taxonomic Units
Measures of diversity
Other useful visualisations
Tag/marker sequencing
Metagenomics

35
Measuring diversity of OTUs

Two primary measures for sequence based studies
Alpha diversity
What is there? How much is there?
Diversity within a sample
Beta diversity
How similar are two samples?
Diversity between samples

36
Measuring diversity

Alpha diversity
Diversity within a sample
Simpsons diversity index (also Shannon, Chao
indexes)
Gives less weight to rarest species

S is the number of species N is the total number
of organisms ni is the number of organisms of
species i
Whittaker, R.H. (1972). "Evolution and
measurement of species diversity". Taxon
(International Association for Plant Taxonomy
(IAPT)) 21 (2/3) 213251
37
Measuring diversity

Beta diversity
Diversity between samples
Sorensens index

S 1 is the number of species in sample 1 S 2 is
the number of species in sample 2 c is the number
of species present n both samples
Whittaker, R.H. (1972). "Evolution and
measurement of species diversity". Taxon
(International Association for Plant Taxonomy
(IAPT)) 21 (2/3) 213251
38
Measuring diversity

Beta diversity
Diversity between samples
Unifrac distance
Percentage observed branch length unique to
either sample

Lozupone and Knight, 2005. Unifrac A new
phylogenetic method for comparing microbial
communitieis. Appl Environ Microbiol 718228
39
Overview

What is environmental sequencing?
Why?
Methods
Operational Taxonomic Units
Measures of diversity
Other useful visualisations
Tag/marker sequencing
Metagenomics

40
Other useful data representations

Simple barcharts
What species are present?
Rarefaction curves
How much of a community have we sampled?
Principal Component Analysis (PCA)
What are the most important factors segregating
communities?
Bootstrapping and jack-knifing
How reliable are our measures of diversity?

41
Simple barcharts
Courtesy T.M. Hudson, University of Exeter
42
Simple charts
Courtesy Greg Caporaso, QIIME
43
Rarefaction curves
Have we sampled enough of a community to get a
true representation?
Number of OTUs
Number of sequences
Adapted from Wooley et al. A Primer on
Metagenomics, PLoS Computational Biology, Feb
2010, Vol 6(2)
44
Principal component analysis
Do samples segregate?
45
Jack-knifing

How much uncertainty is there in the clustering
and PCA plots?
Take a subset of your data
Rerun analysis
Repeat 100s of times
Summarize results of 100s
of analyses

46
Overview

What is metagenomics?
Why?
Case study 1
Assembly, ORFs and Gene finding
Annotation
Case study 2

47
Why metagenomics?

Tag sequencing can only inform species or strain
level classification
If the species is known and previously sequenced
we can have some understanding of the metabolic
pathways present due to that organism
However, most microbes have not been sequenced
Most have never even been identified
The depth of sequencing offered by 454, SoLID and
Illumina sequencers makes metagenomics feasible
Lots of sequences
Possible to get a representative sample of all
genes present
Shorter read length -gt hard to assemble
With current technology the aim is to produce
gene catalogues rather than whole genomes
Limited to prokaryotes

48
Why metagenomics?

We contain 100x more bacterial cells than human
Enivronments of interest
Human gut
Human skin
Human Oral/Nasal and Uritogenetial
Chicken gut microbiome
Terrabase project (Soil metagenomics)
Microbial communities in water (Global Ocean
Sampling survey Venter)
Keyboards
Examine differences between populations
(cross-sectional studies)
Examine changes over time in a single
population (longitudinal study)
Human Microbiome Project
MetaHIT project

49
What pathways are involved?
Ley et al. 2006 Human Gut Microbiomes associated
with obesity. Nature 444 1022-1023
50
Case study 1 Meta-HIT project
The project objectives association of bacterial
genes with human health an disease The central
objective of our project is to establish
associations between the genes of the human
intestinal microbiota and our health and disease.
We focus on two disorders of increasing
importance in Europe, Inflammatory Bowel Disease
(IBD) and obesity.
http//www.metahit.eu
51
Illumina profiling
In total 0.58 Terabase data
52
The Illumina pipeline for Human intestinal
metagenomics analysis
53
The contig set

SOAPdenovo (de Bruijn graph-based tool)
Filtering 500bp contigs
Removal of redundancy

Total Size Number N50 Size N90 Size Max. Length
10.3 Gb 6.6 Million 2.2 kb 0.7 kb 237.6 kb

Estimated assembly error rate 14/Mb
Comparable to 454 (Newbler) 20/Mb

54
Representation of the human gut microbiome in the
contigs
Illumina contigs encompass a great majority of
sequences from this and previous studies
55
The Illumina pipeline for Human intestinal
metagenomics analysis
56
The gene set

Metagene prediction on the contigs
14 million ORFs gt100 bp
Removal of redundancy 95 nucleotide
identity, 90 of the length of the shorter ORF
3.3 million ORFs, 150 times human gene complement
ORFs are identified if present at relative
abundance
7x10-7 we name them prevalent genes

57
The gene set is almost complete
gt85 of prevalent genes of the cohort are present
in the reference set, by the incidence-based
coverage richness estimator (ICE)
58
Human intestinal microbial genes are largely
shared in the cohort
59
Many sequenced bacterial species are shared
Illumina reads mapped on 650 non-redundant
bacterial genomes of a gt1000 genome set, at 90
identity
Genomes detected (unique reads) Coverage
1 10 All individuals 23
0 gt90 64 13 gt50 84 (75) 41
Mostly Firmicutes Bacteroidetes
60
Minimal genome functions required by gut
bacteria

Present in most bacteria
Expected to be most frequent in the gut

61
Overall view of the minimal genome metabolic
pathways (1200 functions)
Letunic et al. 2008 Ipath Interactive
exploration of biochemical pathways and networks
TIBS 33 (3) 101-103
62
PCA of 155 most abundant bacterial species in IBD
patients and healthy controls (n39)
A human gut microbial gene catalogue established
by metagenomic sequencing, Nature 464, 59-65(4
March 2010)
63
MetaHIT paper
64
MetaHIT summary

8 billion reads
576Gb of sequence data
42 of reads assembled into 6.6 million contigs
N50 contigs length of 2.2 kb
81 of genes un-annotated

More reference genomes are needed!
65
Overview

What is metagenomics?
Why?
Case study 1
Assembly, ORFs and Gene finding
Annotation
Case study 2

66
Metagenomic assemblies

Much harder than single-genome assembly
Many identical or nearly identical reads
Reduce size by clustering data first at 100
identity
Cannot remove near-identical low abundance kmers
to reduce memory requirements
These may be sequencing errors
Or may be sequences from low abundance organisms
Can try to focus on gene regions by identifying
putative open reading frame start sites and start
assembly there
Still very early days. Hardware requirements
large.
Meta-Velvet
Soapdenovo
Euler

Ye Y, Tang, H. An orfome assembly approach to
metagenomics 2009 J. Bioinform Comput Biol 7
455-471
67
Gene calling metagenomic assemblies

Gene calling
Finding open reading frames (ORFs) is challenging
when assemblies of gene may only be partial
Start and/or stop coding may be missing
Traditional HMM-based methods (e.g. Genemark)
fail
However, simulations have shown that 85-90 of
genes can be accurately called although this is
best case scenario
Gene families coding for proteins are expected to
be under selective pressure
One method is to select all reading frames from
any ORF identified and use only those which
appear to be under selective pressure
This may miss ORFs under less selective pressure

Mavromatis et al. Use of simulated data sets to
evaluate the fidelity of metagenomic processing
methos. 2007. Nat Methods 4495-500
Yooseph, et al. Gene identification and
classification in microbial metagenomic sequence
data via incremental clustering 2008. BMC
Bioinformatics 9182
68
Gene calling metagenomic assemblies
Yooseph, et al. Gene identification and
classification in microbial metagenomic sequence
data via incremental clustering 2008. BMC
Bioinformatics 9182
69
Overview

What is metagenomics?
Why?
Case study 1
Assembly, ORFs and Gene finding
Annotation
Case study 2

70
Functional annotation
It may make sense just to skip any attempt at
gene calling altogether Instead simply use
6-frame translations of contigs Assuming most
genes are 300-2000bp long we can extract only
these translations (100-660 aa long) Use these
to search either primary (sequence) or secondary
(motif/HMM) databases Software MG-RAST (mainly
454) RAMMCAP Custom pipeline using
Interproscan at EBI (contact Chris Hunter)
71
But
Many organisms and genes are still unknown to
science Therefore homology-based annotation and
even motif and HMM based annotation will only
provide reliable annotation for those proteins we
already know about Current methods will still
miss known genes
72
Case study 2
The project objectives Identify the minimum
percentage content of a sample required to
positively identify the presence of a particular
bacterial species .
73
Relative genome coverage
Courtesy Karen Moore University of Exeter
74
Metagenomic analysis processing steps 3

Does 0.1 genome coverage represent highly
conserved regions present in many species?
Map the reads from one genome dataset onto other
selected genomes
Establish the level of false-positives observed

75
Cross-reactivity of Illumina reads between species
Species Methylobacterium populi Lactococcus lactis Flavobacterium johnsoniae Asticcacaulis excentricus Bordatella petrii
Methylobacterium populi 140333
Lactococcus lactis 106 1241
Flavobacterium johnsoniae 107 66 1225
Asticcacaulis excentricus 241 107 85 16321
Bordatella petrii 142 66 79 124 4605
76
Species indicated by supported unique reads
Initial Washed Ciprofloxacin-treated
Unique reads 838 461 581
gt 500 reads 60 20 41
250-500 reads 34 6 17
77
Spiking DNA
Input 0.005 0.05
0.5
Methylobacterium populi 240126 Methylobacterium populi 206163 Escherichia coli str. K-12 substr. 319973
Spirochaeta coccoides 216319 Spirochaeta coccoides 192403 Methylobacterium populi 249378
Riemerella anatipestifer 118033 Riemerella anatipestifer 78158 Spirochaeta coccoides 186785
Asticcacaulis excentricus ch. 2 39718 Asticcacaulis excentricus ch. 2 33895 Riemerella anatipestifer 87672
Asticcacaulis excentricus ch. 1 36069 Asticcacaulis excentricus ch. 1 30690 Asticcacaulis excentricus ch. 2 36510
Vibrio furnissii ch. 1 26472 Mycoplasma gallisepticum str. R(low) 24703 Asticcacaulis excentricus ch. 1 33006
Bacteroides helcogenes 23770 Escherichia coli str. K-12 substr. 64498 Mycoplasma gallisepticum str. R(low) 20937
Mycoplasma gallisepticum str. R(low) 23559 Pseudomonas stutzeri 21001 Shigella sonnei 72968
Pseudomonas stutzeri 14762 Bacteroides helcogenes 17340 Shigella boydii 71970
Vibrio furnissii ch.2 13126 Shigella sonnei 14753 Shigella flexneri 57622
Escherichia coli str. K-12 substr. 12231 Shigella boydii 14523 Shigella dysenteriae 51900
Methylobacterium extorquens 11166 Shigella flexneri 11593 Bacteroides helcogenes 17995
Mesorhizobium loti 10384 Shigella dysenteriae 10472 Methylobacterium extorquens 11434
Pseudomonas entomophila 6859 Methylobacterium extorquens 9542 Escherichia fergusonii 11005
Pseudomonas putida 6830 Mesorhizobium loti 9455 Mesorhizobium loti 10492
Bradyrhizobium japonicum 6633 Pseudomonas putida 6929 Pseudomonas stutzeri 9902
Methylobacterium chloromethanicum 5877 Pseudomonas entomophila 6641 Pseudomonas putida 6831
Mesorhizobium ciceri biovar biserrulae 5750 Bradyrhizobium japonicum 6158 Pseudomonas entomophila 6822
Mycoplasma bovis 5082 Mesorhizobium ciceri biovar biserrulae 5235 Bradyrhizobium japonicum 6658
Dyadobacter fermentans 4198 Methylobacterium chloromethanicum 5165 Methylobacterium chloromethanicum 6082
78
Reads spread throughout the genomes
Ancestral
Washed

Ciprofloxacin-treated
Flavobacterium
Lactococcus
Methylobacterium
79
Summary

Illumina sequencing allows identification of
genera present in the consortium
Identification at the species level is possible
but the level of coverage to ensure false
positives are minimised is under constant
evaluation
gt500 reads or gt250 reads
0.1 genome coverage
Washing reduces the presence of
gammaprotobacteria, firmicutes and viruses in the
consortia
Addition of ciprofloxacin changes the consortium
dynamics and increases the presence of fungi
This method is being developed into a pipeline
that will enable algal bacterial interactions to
be studied in more detail using only unique
reads.

80
Final, but important points Regardless of the
type of sequencing you are doing
81
Sample preparation

GIGO Garbage In. Garbage Out.
Long term sample storage can cause selective
loss of some species (e.g. Bacteroidetes)
Does not discriminate between dead/inactive and
live microbes (unless extracting RNA)
If doing 16S sequencing, consider using
degenerate bases and choose your variable
region(s) with care.
Study size Make sure you include biological
replicates
16S rRNA results are not quantitative due to
copy number variation

Courtesy Alan Walker, Wellcome Trust Sanger
82
Sequencing quality control

Sequencing is not error free
Quality filtering is vital to avoid introducing
false diversity
Ensure adaptor sequences are removed
Platform specific errors
Library preparation itself can introduce errors
Direct DNA sequencing e.g. Oxford Nanopore
PCR amplification is not perfect
PCR enzyme can jump from one DNA strand to
another and
Introduce chimeric sequences
Ensure computational methods are used to identify
and remove these
PyroNoise 454 chimeric sequence removal

83
Summary
Courtesy Alan Walker, Wellcome Trust Sanger
84
After lunch!
85
QIIME Quantitative Insights Into Microbial
Ecology
86
Questions?
Konrad Paszkiewicz k.h.paszkiewicz_at_exeter.ac.uk

Write a Comment

User Comments (0)