DNA sequencing & microbial profiling. Multiple sequenc - PowerPoint PPT Presentation

1 / 86
About This Presentation
Title:

DNA sequencing & microbial profiling. Multiple sequenc

Description:

DNA sequencing & microbial profiling. Multiple sequence based options: Sequence tag surveys based on single marker genes Predominantly 16S rRNA prokaryotes ... – PowerPoint PPT presentation

Number of Views:419
Avg rating:3.0/5.0
Slides: 87
Provided by: academyNe
Category:

less

Transcript and Presenter's Notes

Title: DNA sequencing & microbial profiling. Multiple sequenc


1
  • NESCent Summer School
  • Environmental tag sequencing and
  • metagenomics

Dr Konrad Paszkiewicz Exeter Sequencing Service,
University of Exeter, UK.
2
Environmental sequencing ? Metagenomics?
Wooley JC, Godzik A, Friedberg I, 2010 A Primer
on Metagenomics. PLoS Comput Biol 6(2)
3
Overview
  • What is environmental tag sequencing?
  • Why?
  • Methods
  • Operational Taxonomic Units
  • Measures of diversity
  • Other useful visualisations
  • Tag/marker sequencing
  • Metagenomics

4
Overview
  • What is metagenomics?
  • Why?
  • Case study 1
  • Assembly, ORFs and Gene finding
  • Annotation
  • Case study 2

5
DNA sequencing microbial profiling
  • Traditional microbiology relies on isolation and
    culture of bacteria
  • Cumbersome and labour intensive process
  • Fails to account for the diversity of microbial
    life
  • Great plate-count anomaly

Staley, J. T., and A. Konopka. 1985. Measurements
of in situ activities of nonphotosynthetic
microorganisms in aquatic and terrestrial
habitats. Annu. Rev. Microbiol. 39321-346
6
Why environmental sequencing?
  • Only a small proportion of organisms have been
    grown in culture
  • Species do not live in isolation
  • Clonal cultures fail to represent the natural
    environment of a given organism
  • Many proteins and protein functions remain
    undiscovered

7
Why environmental sequencing?
Estimated 1000 trillion tons of bacterial/archeal
life on Earth Most organisms are difficult to
grow in culture Recent discovery of a
completely new fungal group Cryptomycota
Discovered by environmental sequencing of
Exeter University's pond water...
Jones, M. D. M. et al. Nature doi10.1038/nature09
984 (2011).
8
Why environmental sequencing?
Turnbaugh et al. 2006 An obesity associated gut
microbiome with increased capacity for energy
harvest. Nature 444 1027-1031
9
Results translate to humans
10x more bacterial cells than human 100-fold
more unique genes
Ley et al. 2006 Human Gut Microbiomes associated
with obesity. Nature 444 1022-1023
10
Overview
  • What is environmental sequencing?
  • Why?
  • Methods
  • Operational Taxonomic Units
  • Measures of diversity
  • Other useful visualisations
  • Tag/marker sequencing
  • Metagenomics

11
DNA sequencing microbial profiling
Multiple sequence based options Sequence
tag surveys based on single marker genes
Predominantly 16S rRNA prokaryotes, 18S rRNA for
eukaryotes Other genes such as rpoB also be
used. Initially done with cloning step and
Sanger sequencing (can generate
sequences that cover the full-length of the
gene) 454 pyrosequencing now the most widely
used approach (shorter reads but greater
depth) Illumina can also be used with
overlapping paired-end reads for even
shorter reads but 100x greater depth than 454
First trials with PacBio system (1-20kb but only
50,000 seqs/run) Metagenomics Single-cell
genomics
12
Sequencing costs
Courtesy Greg Caporaso
13
16S rRNA sequencing
  • 16S rRNA forms part of bacterial ribosomes.
  • Contains regions of highly conserved and highly
    variable sequence.
  • Variable sequence can be thought of as a
    molecular fingerprint.can be used to identify
    bacterial genera and species.
  • Large public databases available for
    comparison.Ribosomal Database Project currently
    contains gt1.5 million rRNA sequences.
  • Conserved regions can be targeted to amplify
    broad range of bacteria from environmental
    samples.
  • Not quantitative due to copy number variation

Erlandsen S L et al. J Histochem Cytochem
200553917-927
Circumvents the need to culture
Alan Walker, Sanger
14
16S sequencing redefined the tree of life
Woese C, Fox G (1977). "Phylogenetic structure of
the prokaryotic domain the primary kingdoms.".
Proc Natl Acad Sci USA 74 (11) 508890. Woese C,
Kandler O, Wheelis M (1990). "Towards a natural
system of organisms proposal for the domains
Archaea, Bacteria, and Eucarya.". Proc Natl Acad
Sci USA 87 (12) 45769
15
Which hyper-variable regions to sequence?
E.coli 16S SSU rRNA hyper-variable regions
Region Position b.p.
V1 69-99 30
V2 137-242 105
V3 338-533 195
V4 576-682 106
V5 822-879 57
V6 967-1046 79
V7 1117-1173 56
V8 1243-1294 51
V9 1435-1465 30
A detailed analysis of 16S ribosomal RNA gene
segments for the diagnosis of pathogenic bacteria
J Microbiol Methods. 2007 May 69(2) 330339
A quantitative map of nucleotide substitution
rates in bacterial rRNA van der Peer et al
Nucleic Acids Research, 1996, Vol. 24, No. 17
33813391
16
454-based 16S amplicon sequencing
17
Using overlapping paired-end Illumina reads
100bp
100bp
200bp
18
Using overlapping paired-end Illumina reads
100bp
100bp
Merge reads into single high quality 150bp read
150bp
19
Using overlapping paired-end Illumina reads
  • 150bp reads useful for sequencing of individual
    variable regions (e.g. V3,V6)
  • Even single-end reads can be useful
  • Enables 3-120 million of reads per sample 100x
    more than 454

20
Using overlapping paired-end Illumina reads
  • 150bp reads useful for sequencing of individual
    variable regions (e.g. V3,V6)
  • Even single-end reads can be useful
  • Enables 3-120 million of reads per sample

21
Overview
  • What is environmental sequencing?
  • Why?
  • Methods
  • Operational Taxonomic Units
  • Measures of diversity
  • Other useful visualisations
  • Tag/marker sequencing
  • Metagenomics

22
How do we define a species?
No single definition has satisfied all
naturalists yet every naturalist knows vaguely
what he means when he speaks of a species
Charles Darwin, On the Origin of Species, 1859
23
How do we define a species for tag data?
  • Species concept works for sexually reproducing
    organisms
  • Breaks down when applied to bacteria and fungi
  • Plasmids
  • Horizontal gene transfer
  • Transposons/Viruses
  • Operational Taxonomic Unit (OTU)
  • An arbitrary definition of a taxonomic unit based
    on sequence divergence
  • OTU definitions matter

24
How do we define a species for tag data?
  • Search for sequence similarity between 16S/18S
    variable regions (e.g. V1-V3) or particular
    genes (e.g rpoB)
  • These genes are house-keeping genes which are
    less likely to be involved in horizontal
    transfer
  • However, note that 16S/18S sequences are known
    to have variable copy numbers which can bias
    results

De Santis et al. Greengenes, a chimera-checked
16s rRNA gene database. Appl Env Microbiol 72
5069-5072 www.mlst.net
25
Binning tags
  • Tags may be analysed in one of two ways
  • Composition-based binning
  • Relies on comparisons of gross-features to
    species/genus/families which share these features
  • GC content
  • Di/Tri/Tetra/... nucleotide composition
    (kmer-based frequency comparison)
  • Codon usage statistics
  • Similarity-based binning
  • Requires that most sequences in a sample are
    present in a reference database
  • Direct comparison of OTU sequence to a reference
    database
  • Identity cut-off varies depending on resolution
    required
  • Genus - 90
  • Family - 80
  • Species - 97
  • Multiple marker genes used for finer
    sub-strain identification (MLST)
  • Too stringent cut-off selection will lead to
    excessive diversity being reported
  • Sequencing errors
  • Sample prep issues

26
Binning tags
Sample 1
Sample 2
27
A word on the importance of clustering algorithms
The clustering algorithm used to determine
distances between OTUs determines the form of the
resulting phylogenetic tree
28
A word on the importance of clustering algorithms
Average neighbour clustering seems to give the
most robust results
29
Software for binning tags
  • Tags may be analysed in one of two ways
  • Composition-based binning
  • TETRA - Maximal-Order Markov Model
  • PhyloPythia Support Vector
  • Seeded Growing Self-Organising Maps (S-GSOM)
  • TETRA Codon based usage
  • Similarity-based binning
  • Requires that most sequences in a sample are
    present in a primary or secondary reference
    database
  • QIIME (Todays workshop!)
  • MEGAN (comparison against Blast NCBI NR)
  • Mothur
  • CARMA (comparison against PFAM)
  • Phymm
  • ARB (linked with Silva database)

Wooley et al. A Primer on Metagenomics, PLoS
Computational Biology, Feb 2010, Vol 6(2)
30
Sequence databases for 16S similarity-based
binning
31
Sequence databases for 16S similarity-based
binning
32
Sequence databases for 16S similarity-based
binning
33
Sequence databases for 16S similarity-based
binning
34
Overview
  • What is environmental sequencing?
  • Why?
  • Methods
  • Operational Taxonomic Units
  • Measures of diversity
  • Other useful visualisations
  • Tag/marker sequencing
  • Metagenomics

35
Measuring diversity of OTUs
  • Two primary measures for sequence based studies
  • Alpha diversity
  • What is there? How much is there?
  • Diversity within a sample
  • Beta diversity
  • How similar are two samples?
  • Diversity between samples

36
Measuring diversity
  • Alpha diversity
  • Diversity within a sample
  • Simpsons diversity index (also Shannon, Chao
    indexes)
  • Gives less weight to rarest species

S is the number of species N is the total number
of organisms ni is the number of organisms of
species i
Whittaker, R.H. (1972). "Evolution and
measurement of species diversity". Taxon
(International Association for Plant Taxonomy
(IAPT)) 21 (2/3) 213251
37
Measuring diversity
  • Beta diversity
  • Diversity between samples
  • Sorensens index

S 1 is the number of species in sample 1 S 2 is
the number of species in sample 2 c is the number
of species present n both samples
Whittaker, R.H. (1972). "Evolution and
measurement of species diversity". Taxon
(International Association for Plant Taxonomy
(IAPT)) 21 (2/3) 213251
38
Measuring diversity
  • Beta diversity
  • Diversity between samples
  • Unifrac distance
  • Percentage observed branch length unique to
    either sample

Lozupone and Knight, 2005. Unifrac A new
phylogenetic method for comparing microbial
communitieis. Appl Environ Microbiol 718228
39
Overview
  • What is environmental sequencing?
  • Why?
  • Methods
  • Operational Taxonomic Units
  • Measures of diversity
  • Other useful visualisations
  • Tag/marker sequencing
  • Metagenomics

40
Other useful data representations
  • Simple barcharts
  • What species are present?
  • Rarefaction curves
  • How much of a community have we sampled?
  • Principal Component Analysis (PCA)
  • What are the most important factors segregating
    communities?
  • Bootstrapping and jack-knifing
  • How reliable are our measures of diversity?

41
Simple barcharts
Courtesy T.M. Hudson, University of Exeter
42
Simple charts
Courtesy Greg Caporaso, QIIME
43
Rarefaction curves
Have we sampled enough of a community to get a
true representation?
Number of OTUs
Number of sequences
Adapted from Wooley et al. A Primer on
Metagenomics, PLoS Computational Biology, Feb
2010, Vol 6(2)
44
Principal component analysis
Do samples segregate?
45
Jack-knifing
  • How much uncertainty is there in the clustering
    and PCA plots?
  • Take a subset of your data
  • Rerun analysis
  • Repeat 100s of times
  • Summarize results of 100s
  • of analyses

46
Overview
  • What is metagenomics?
  • Why?
  • Case study 1
  • Assembly, ORFs and Gene finding
  • Annotation
  • Case study 2

47
Why metagenomics?
  • Tag sequencing can only inform species or strain
    level classification
  • If the species is known and previously sequenced
    we can have some understanding of the metabolic
    pathways present due to that organism
  • However, most microbes have not been sequenced
  • Most have never even been identified
  • The depth of sequencing offered by 454, SoLID and
    Illumina sequencers makes metagenomics feasible
  • Lots of sequences
  • Possible to get a representative sample of all
    genes present
  • Shorter read length -gt hard to assemble
  • With current technology the aim is to produce
    gene catalogues rather than whole genomes
  • Limited to prokaryotes

48
Why metagenomics?
  • We contain 100x more bacterial cells than human
  • Enivronments of interest
  • Human gut
  • Human skin
  • Human Oral/Nasal and Uritogenetial
  • Chicken gut microbiome
  • Terrabase project (Soil metagenomics)
  • Microbial communities in water (Global Ocean
    Sampling survey Venter)
  • Keyboards
  • Examine differences between populations
    (cross-sectional studies)
  • Examine changes over time in a single
    population (longitudinal study)
  • Human Microbiome Project
  • MetaHIT project

49
What pathways are involved?
Ley et al. 2006 Human Gut Microbiomes associated
with obesity. Nature 444 1022-1023
50
Case study 1 Meta-HIT project
The project objectives association of bacterial
genes with human health an disease The central
objective of our project is to establish
associations between the genes of the human
intestinal microbiota and our health and disease.
We focus on two disorders of increasing
importance in Europe, Inflammatory Bowel Disease
(IBD) and obesity.
http//www.metahit.eu
51
Illumina profiling
In total 0.58 Terabase data
52
The Illumina pipeline for Human intestinal
metagenomics analysis
53
The contig set
  • SOAPdenovo (de Bruijn graph-based tool)
  • Filtering 500bp contigs
  • Removal of redundancy

Total Size Number N50 Size N90 Size Max. Length
10.3 Gb 6.6 Million 2.2 kb 0.7 kb 237.6 kb
  • Estimated assembly error rate 14/Mb
  • Comparable to 454 (Newbler) 20/Mb

54
Representation of the human gut microbiome in the
contigs
Illumina contigs encompass a great majority of
sequences from this and previous studies
55
The Illumina pipeline for Human intestinal
metagenomics analysis
56
The gene set
  • Metagene prediction on the contigs
  • 14 million ORFs gt100 bp
  • Removal of redundancy 95 nucleotide
    identity, 90 of the length of the shorter ORF
  • 3.3 million ORFs, 150 times human gene complement
  • ORFs are identified if present at relative
    abundance
  • 7x10-7 we name them prevalent genes

57
The gene set is almost complete
gt85 of prevalent genes of the cohort are present
in the reference set, by the incidence-based
coverage richness estimator (ICE)
58
Human intestinal microbial genes are largely
shared in the cohort
59
Many sequenced bacterial species are shared
Illumina reads mapped on 650 non-redundant
bacterial genomes of a gt1000 genome set, at 90
identity
Genomes detected (unique reads) Coverage
1 10 All individuals 23
0 gt90 64 13 gt50 84 (75) 41
Mostly Firmicutes Bacteroidetes
60
Minimal genome functions required by gut
bacteria
  • Present in most bacteria
  • Expected to be most frequent in the gut

61
Overall view of the minimal genome metabolic
pathways (1200 functions)
Letunic et al. 2008 Ipath Interactive
exploration of biochemical pathways and networks
TIBS 33 (3) 101-103
62
PCA of 155 most abundant bacterial species in IBD
patients and healthy controls (n39)
A human gut microbial gene catalogue established
by metagenomic sequencing, Nature 464, 59-65(4
March 2010)
63
MetaHIT paper
64
MetaHIT summary
  • 8 billion reads
  • 576Gb of sequence data
  • 42 of reads assembled into 6.6 million contigs
  • N50 contigs length of 2.2 kb
  • 81 of genes un-annotated

More reference genomes are needed!
65
Overview
  • What is metagenomics?
  • Why?
  • Case study 1
  • Assembly, ORFs and Gene finding
  • Annotation
  • Case study 2

66
Metagenomic assemblies
  • Much harder than single-genome assembly
  • Many identical or nearly identical reads
  • Reduce size by clustering data first at 100
    identity
  • Cannot remove near-identical low abundance kmers
    to reduce memory requirements
  • These may be sequencing errors
  • Or may be sequences from low abundance organisms
  • Can try to focus on gene regions by identifying
    putative open reading frame start sites and start
    assembly there
  • Still very early days. Hardware requirements
    large.
  • Meta-Velvet
  • Soapdenovo
  • Euler

Ye Y, Tang, H. An orfome assembly approach to
metagenomics 2009 J. Bioinform Comput Biol 7
455-471
67
Gene calling metagenomic assemblies
  • Gene calling
  • Finding open reading frames (ORFs) is challenging
    when assemblies of gene may only be partial
  • Start and/or stop coding may be missing
  • Traditional HMM-based methods (e.g. Genemark)
    fail
  • However, simulations have shown that 85-90 of
    genes can be accurately called although this is
    best case scenario
  • Gene families coding for proteins are expected to
    be under selective pressure
  • One method is to select all reading frames from
    any ORF identified and use only those which
    appear to be under selective pressure
  • This may miss ORFs under less selective pressure

Mavromatis et al. Use of simulated data sets to
evaluate the fidelity of metagenomic processing
methos. 2007. Nat Methods 4495-500
Yooseph, et al. Gene identification and
classification in microbial metagenomic sequence
data via incremental clustering 2008. BMC
Bioinformatics 9182
68
Gene calling metagenomic assemblies
Yooseph, et al. Gene identification and
classification in microbial metagenomic sequence
data via incremental clustering 2008. BMC
Bioinformatics 9182
69
Overview
  • What is metagenomics?
  • Why?
  • Case study 1
  • Assembly, ORFs and Gene finding
  • Annotation
  • Case study 2

70
Functional annotation
It may make sense just to skip any attempt at
gene calling altogether Instead simply use
6-frame translations of contigs Assuming most
genes are 300-2000bp long we can extract only
these translations (100-660 aa long) Use these
to search either primary (sequence) or secondary
(motif/HMM) databases Software MG-RAST (mainly
454) RAMMCAP Custom pipeline using
Interproscan at EBI (contact Chris Hunter)
71
But
Many organisms and genes are still unknown to
science Therefore homology-based annotation and
even motif and HMM based annotation will only
provide reliable annotation for those proteins we
already know about Current methods will still
miss known genes
72
Case study 2
The project objectives Identify the minimum
percentage content of a sample required to
positively identify the presence of a particular
bacterial species .
73
Relative genome coverage
Courtesy Karen Moore University of Exeter
74
Metagenomic analysis processing steps 3
  • Does 0.1 genome coverage represent highly
    conserved regions present in many species?
  • Map the reads from one genome dataset onto other
    selected genomes
  • Establish the level of false-positives observed

75
Cross-reactivity of Illumina reads between species
Species Methylobacterium populi Lactococcus lactis Flavobacterium johnsoniae Asticcacaulis excentricus Bordatella petrii
Methylobacterium populi 140333
Lactococcus lactis 106 1241
Flavobacterium johnsoniae 107 66 1225
Asticcacaulis excentricus 241 107 85 16321
Bordatella petrii 142 66 79 124 4605
76
Species indicated by supported unique reads
Initial Washed Ciprofloxacin-treated
Unique reads 838 461 581
gt 500 reads 60 20 41
250-500 reads 34 6 17
77
Spiking DNA
Input 0.005 0.05
0.5
Methylobacterium populi 240126 Methylobacterium populi 206163 Escherichia coli str. K-12 substr. 319973
Spirochaeta coccoides 216319 Spirochaeta coccoides 192403 Methylobacterium populi 249378
Riemerella anatipestifer 118033 Riemerella anatipestifer 78158 Spirochaeta coccoides 186785
Asticcacaulis excentricus ch. 2 39718 Asticcacaulis excentricus ch. 2 33895 Riemerella anatipestifer 87672
Asticcacaulis excentricus ch. 1 36069 Asticcacaulis excentricus ch. 1 30690 Asticcacaulis excentricus ch. 2 36510
Vibrio furnissii ch. 1 26472 Mycoplasma gallisepticum str. R(low) 24703 Asticcacaulis excentricus ch. 1 33006
Bacteroides helcogenes 23770 Escherichia coli str. K-12 substr. 64498 Mycoplasma gallisepticum str. R(low) 20937
Mycoplasma gallisepticum str. R(low) 23559 Pseudomonas stutzeri 21001 Shigella sonnei 72968
Pseudomonas stutzeri 14762 Bacteroides helcogenes 17340 Shigella boydii 71970
Vibrio furnissii ch.2 13126 Shigella sonnei 14753 Shigella flexneri 57622
Escherichia coli str. K-12 substr. 12231 Shigella boydii 14523 Shigella dysenteriae 51900
Methylobacterium extorquens 11166 Shigella flexneri 11593 Bacteroides helcogenes 17995
Mesorhizobium loti 10384 Shigella dysenteriae 10472 Methylobacterium extorquens 11434
Pseudomonas entomophila 6859 Methylobacterium extorquens 9542 Escherichia fergusonii 11005
Pseudomonas putida 6830 Mesorhizobium loti 9455 Mesorhizobium loti 10492
Bradyrhizobium japonicum 6633 Pseudomonas putida 6929 Pseudomonas stutzeri 9902
Methylobacterium chloromethanicum 5877 Pseudomonas entomophila 6641 Pseudomonas putida 6831
Mesorhizobium ciceri biovar biserrulae 5750 Bradyrhizobium japonicum 6158 Pseudomonas entomophila 6822
Mycoplasma bovis 5082 Mesorhizobium ciceri biovar biserrulae 5235 Bradyrhizobium japonicum 6658
Dyadobacter fermentans 4198 Methylobacterium chloromethanicum 5165 Methylobacterium chloromethanicum 6082
78
Reads spread throughout the genomes
Ancestral
Washed

Ciprofloxacin-treated
Flavobacterium
Lactococcus
Methylobacterium
79
Summary
  • Illumina sequencing allows identification of
    genera present in the consortium
  • Identification at the species level is possible
    but the level of coverage to ensure false
    positives are minimised is under constant
    evaluation
  • gt500 reads or gt250 reads
  • 0.1 genome coverage
  • Washing reduces the presence of
    gammaprotobacteria, firmicutes and viruses in the
    consortia
  • Addition of ciprofloxacin changes the consortium
    dynamics and increases the presence of fungi
  • This method is being developed into a pipeline
    that will enable algal bacterial interactions to
    be studied in more detail using only unique
    reads.

80
Final, but important points Regardless of the
type of sequencing you are doing
81
Sample preparation
  • GIGO Garbage In. Garbage Out.
  • Long term sample storage can cause selective
    loss of some species (e.g. Bacteroidetes)
  • Does not discriminate between dead/inactive and
    live microbes (unless extracting RNA)
  • If doing 16S sequencing, consider using
    degenerate bases and choose your variable
    region(s) with care.
  • Study size Make sure you include biological
    replicates
  • 16S rRNA results are not quantitative due to
    copy number variation

Courtesy Alan Walker, Wellcome Trust Sanger
82
Sequencing quality control
  • Sequencing is not error free
  • Quality filtering is vital to avoid introducing
    false diversity
  • Ensure adaptor sequences are removed
  • Platform specific errors
  • Library preparation itself can introduce errors
  • Direct DNA sequencing e.g. Oxford Nanopore
  • PCR amplification is not perfect
  • PCR enzyme can jump from one DNA strand to
    another and
  • Introduce chimeric sequences
  • Ensure computational methods are used to identify
    and remove these
  • PyroNoise 454 chimeric sequence removal

83
Summary
Courtesy Alan Walker, Wellcome Trust Sanger
84
After lunch!
85
QIIME Quantitative Insights Into Microbial
Ecology
86
Questions?
Konrad Paszkiewicz k.h.paszkiewicz_at_exeter.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com