Title: Freek T. Bakker
1Optimising DNA barcode regions
- Freek T. Bakker
- Nationaal Herbarium Nederland, Wageningen
University branch, - Biosystematics Group, Wageningen UR
- The Netherlands
2Structure of this talk
- DNA barcoding, CBOL, GenBank
- Non-COI protocol
- Models in DN barcode matching
3DNA barcoding
Using molecular data as species diagnostics
isnt new, but global standardization and scale
of implementation are
4http//www.barcoding.si.edu/
5CBOL Structure
Member Organizations
Executive Committee
Secretariat Office
Working Groups
Scientific Advisory Board
6Uses of DNA Barcodes
- Establish reference library of barcodes from
identified voucher specimens - If necessary, revise species limits
- Then
- Identify unknowns by searching against reference
sequences - Look for matches (mismatches) against library on
a chip - Before long Analyze relative abundance in
multi-species samples
7Reference versus micro-Barcodes
- BARCODE reference records
- Adhere to data standards
- Bidirectional reads, 500 bp long
- Linked to voucher, species name
- Query barcode records
- Used in BLAST or other searches
- Often single pass reads
- Often very short 100 bp for good IDs
- Can cost less than 2, take less than 6 hours
8DNA barcode default
The Consortium for the Barcode of Life (CBOL) has
so far accepted the 648 base-pair Folmer region
of COI (mitochondrial encoded cytochrome oxidase
1) as the default DNA barcode region for
vertebrates and insects and promotes its use in
as many other clades as possible. The
International Nucleotide Sequence Database
Collaboration (INSDC, consisting of GenBank, the
European Molecular Biology Laboratory and the DNA
Data Bank of Japan) has adopted the data
standards proposed by CBOL for BARCODE data
records, and has empowered CBOL to decide which
gene regions can be given BARCODE status.
9CBOL ?? GenBank
GenBank
New CO1 barcode
Data standards
CBoL
10How many DNA barcodes do we need, or, whats
ahead?
- 1.7 x 106 described species
- 10 barcodes per species
- 20 x 106 barcodes of 650bp each
- 10 x 106 more eukaryote species to go
- 100 x 106 more barcodes of 650bp each
- In total this would be 65,000,000,000 bp
- This is twice the total amount of bp currently in
GenBank! - To be completed within the decade
- (Hajibabaei al., 2005)
11(No Transcript)
12(No Transcript)
13Optimal DNA barcodes
- Barcoding gap high inter-specific, low
intra-specific sequence divergence - Universal amplification/sequencing with standard
primers - Technically simple to sequence
- Short enough to sequence in one reaction
- Easily alignable (few insertions/deletions)
- Readily recoverable from museum or herbarium
samples and other degraded samples
14CO1 divergence in eukaryotes
15CBOL ?? GenBank
GenBank
Non-CO1 barcode
CBoL
16non-COI barcode regions
- COI alone will not do
- mtDNA evolution too variable across major clades
- NUMTs
- Other faults (e.g. heteroplasmy, introgression,
COI not present e.g. Rubinoff al. 2007) - rDNA ITS, D3/D4, cpDNA rpoC1, rpoB, matK
- Multiple barcodes
17CBoLs non-CO1 protocol
- Protocol, to be used as guideline, available now
- Reject CO1 as suitable region for clade of
interest - Propose alternative region based on required
evidence as documented - Barcode gap?
- NJ tree
- Multiple regions?
18The DNA barcode gap
From Meyer al. PLoS Biology 2004
19DNA barcode gap
From Van Velzen al. NEV 2007
20DNA barcode gap
- Discontinuity minimum inter- and maximum intra
species divergence - However, in paraphyletically clustered
barcodes intra gt inter divergence!
21Agave (Agavaceae) rpoB Cowan al.
22Rejection of CO1
- Reject CO1 as suitable region for clade of
interest - Propose alternative region based on required
evidence, i.e. - Pattern of intra- and interspecific variation
- Resolving power
- Universality
- Document the number of primer pairs needed to
succesfully PCR amplify identify species
throughout the clade of interest
23Implementation
- Protocols will be adopted for a period of 6
months during which CBOL is open to suggestions
for their improvement from the community. - CBOL will normally expect publication of evidence
for effectiveness of proposed non-COI barcode
region(s) in a peer-reviewed publication prior to
submission of a proposal - Prior peer review and publication will support
the proposals claims and will inform the
community of the proposed barcode region(s) - Upon approval by CBOLs Executive Committee,
INSDC will be informed immediately and BARCODE
status can be given
24Challenges
- Is effectiveness of DNA barcode jeopardized by
using parameter-poor models? - Is NJ too crude to provide correct matches
between closely related barcodes? - How will non-coding DNA sequences perform when
matching unknowns? - Do we need Bayesian matching for critical
species? (PPs on match, Priors to express
uncertainty on population parameters) - Is matching of multiple barcodes a special case?
25DNA barcode matching
- Character-based for closely related barcodes?
- Phylogenetic clustering
- Distance-based matching what models?
- Low divergence ? few parameters (JK, K2P)
- Codon models?
- Composite barcodes ? composite models?
- Non-coding regions length-variation
- Pragmatism large reference libraries, speed
26DNA barcode models
- Simulate DNA barcodes using parameter-rich
models, derived from insect COI and from cpDNA
atpB data (GTR, c113) - 100 replicates of simulated data sets 60 barcode
sequences of 654nt - Distance models simple ? complex
- NJ clustering of resulting distances
- Semistrict consensus of 100 NJ trees
27DNA barcode models
NJ (poor model)
100 NJ trees
Semistrict consensus
NJatpB r/p
NJCOI r/p
28DNA barcode models
- Findmodel (Los Alamos National Lab.)
best-fitting model for Lepidopteran COI data set - MrBayes/Tracer model parameter values
- Simulation tree angiosperm species-level
phylogenetic tree topology (not ultrametric) - Seq-Gen simulate 100 reps., 654nt60 seqs.
- PAUP NJ and consensus analysis
- TreeView tree interpretation
29cpDNA atpB model
Relative subst. rates
Base composition
30mtDNA COI model
Relative subst. rates
Base composition
31atpB vs. COI models
Relative subst. rates
Base composition
atpB
COI
32atpB
Model tree atpB/GTR
33COI
Model tree atpB/GTR
34Over-parametrization?
- Parameter rich models not efficient in
reconstructing parameter-rich patterns? - Parameter-poor models do better
- Artefact of pairwise comparison?
- Various shapes branch lengths
- Different base-composition across tree
- Different omega rates across tree
- Codon models?
35Conclusions
- Non-COI barcode regions will be needed and are
proposed through CBOL protocol - CBOL approval ? adoption by INSDC
- NJ/K2P sufficient for performance testing of
proposed region - Character-based DNA barcode matching needed for
closely related barcodes - Multiple barcodes matched simultaneously