Title: BS961
1BS961
2Objectives
- Describe how the genome sequence of specific
microorganisms can be exploited in clinical
practice. - Describe basic principles behind manual and
automated DNA sequencing, and pyrosequencing. - Explain how nucleotide sequence databases are
accessed and describe the different types of
databases. - Discuss how genes are identified in nucleotide
sequences. - Reading Brown. Genomes 3,Chapter 4.
3DNA-based methods for virus pathogens
4Microarrays
- Miller and Tang (2009) Clinical Microbiology
Reviews 22, 611-633.
5Microarrays
6Microarrays
7Microarrays- examples
- Respiratory pathogens
- Several systems available
- e.g. ResPlex II assay Qiagen Flu-A, Flu-B,
PIV-1,PIV-2, PIV-3,PIV-4, RSV-A, RSV-B, hMPV,
RhV, EnV, and severe acute respiratory CoV - Multiplex RT-PCR
8Microarrays- examples
- For each pathogen, target-specific capture probes
are covalently linked to a specific set of
color-coded beads. - Labeled PCR products are captured by the
bead-bound capture probes in a hybridization
suspension.
9Microarrays- examples
- A microfluidics system delivers the suspension
hybridization reaction mixture to a dual laser
detection device. - A red laser identifies each bead (or pathogen) by
its color-coding - A green laser detects the hybridization signal
associated with each bead (indicating the
presence or absence of a particular pathogen).
10Microarrays examples
11Real time PCR
12Real time PCR example
- Nix et al (2008) Journal of Clinical
Microbiology, 46, 2519-2524. - Parechoviruses
- Uses primers to regions present in all
parechoviruses
13Parechovirus
14Multiplex PCR
- Can multiplex using probes of different colour
15Sequencing strategies
- Sequencing usually achieved by
dideoxynucleotide method - This requires
- Template DNA to be sequenced, together with a
primer and DNA polymerase. - Modified nucleotides, lacking 3OH needed for
chain extension in DNA synthesis-
dideoxynucleotides. Mixed with ordinary
nucleotides, so at each position some chains are
terminated and some are not, so a range of
fragments is generated, each ending with the
specific dideoxynucleotide. - A gel system capable of separating DNA on the
basis of size with a resolution of one
nucleotide. - A detection method- usually dye-labelled
dideoxynucleotides (each of AGCT labelled with a
dye of different colour) detectable by laser.
16Dideoxynucleotide sequencing
- AAGCTAGCTGGCAAATGGCGTCTCAC
- TTCGATCGgt primer
- TTCGATCGA
- TTCGATCGAC
- TTCGATCGACC
17Detection of bands
18Output
19Sequence assembly
- In all sequencing projects the amount of sequence
which can be obtained from one reaction is much
less than that needed for the completion of the
project- some kind of assembly of contiguous
sequences (contigs) from several overlapping
sequences is needed.
20(No Transcript)
21(No Transcript)
22(No Transcript)
23Strategies for small genomes
- For small genomes, e.g. bacteria, an almost
completely shot-gun approach is often the most
efficient, with completion of gaps by more
directed methods. - e.g. Haemophilus influenzae
24Sequencing of Haemophilus influenzae
25Assembling contigs
Sequence 1
GATTCGTAGGCTTTAAGCTTCCGTCGACGCTGCGTAGC
26Assembling contigs
Sequence 1 Enter into database
GATTCGTAGGCTTTAAGCTTCCGTCGACGCTGCGTAGC
27Assembling contigs
Sequence 1 Enter into database Sequence 2
28Assembling contigs
Sequence 1 Enter into database Sequence 2
Compare with database
29Assembling contigs
Sequence 1 Enter into database Sequence 2
Compare with database Does it overlap with
sequence 1?
30Assembling contigs
Sequence 1 Enter into database Sequence 2
Compare with database Does it overlap with
sequence 1?
CGATCGTGCCCCGTACTGACTGCATGCTGACACAGTC
GATTCGTAGGCTTTAAGCTTCCGTCGACGCTGCGTA
31Assembling contigs
Sequence 1 Enter into database Sequence 2
Compare with database Does it overlap with
sequence 1?
CGATCGTGCCCCGTACTGACTGCATGCTGACACA
GTC GATTCGTAGGCTTTAAGCTTCCGTCG
ACGCTGCGTA
32Assembling contigs
Sequence 1 Enter into database Sequence 2
Compare with database Does it overlap with
sequence 1?
CGATCGTGCCCCGTACTGACTGCATG
GATTCGTAGGCTTTAAGCTTCCGTCGACGCTGCGTA
33Assembling contigs
Sequence 1 Enter into database Sequence 2
Compare with database Does it overlap with
sequence 1? No.
X
34Assembling contigs
Sequence 1 Enter into database Sequence 2
Compare with database Does it overlap with
sequence 1? No. Enter into database
35Assembling contigs
Sequence 1 Enter into database Sequence 2
Compare with database Does it overlap with
sequence 1? No. Enter into database Sequence 3
36Assembling contigs
Sequence 1 Enter into database Sequence 2 Does
it overlap with sequence 1? No. Enter into
database Sequence 3 Compare with database
37Assembling contigs
Sequence 1 Enter into database Sequence 2 Does
it overlap with sequence 1? No. Enter into
database Sequence 3 Compare with database Does it
overlap with Sequence 1 or Sequence 2
38Assembling contigs
Sequence 1 Enter into database Sequence 2 Does
it overlap with sequence 1? No. Enter into
database Sequence 3 Compare with database Does it
overlap with Sequence 1 or Sequence 2 No.
39Assembling contigs
Sequence 1 Enter into database Sequence 2 Does
it overlap with sequence 1? No. Enter into
database Sequence 3 Compare with database Does it
overlap with Sequence 1 or Sequence 2 No. Enter
into database
40Assembling contigs
Sequence 4
41Assembling contigs
Sequence 4 Compare with database
42Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3?
43Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No.
44Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into database
45Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into
database Sequence 5
46Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into
database Sequence 5 Compare with database
47Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into
database Sequence 5 Compare with database Does it
overlap with Sequences 1, 2, 3 or 4.
48Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into
database Sequence 5 Compare with database Does it
overlap with Sequences 1, 2, 3 or 4. YES.
49Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into
database Sequence 5 Compare with database Does it
overlap with Sequences 1, 2 or 3. YES. Overlaps
sequence 2.
50Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into
database Sequence 5 Compare with database Does it
overlap with Sequences 1, 2 or 3. YES. Overlaps
sequence 2. Make contig and enter into
database
51Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into
database Sequence 5 Compare with database Does it
overlap with Sequences 1, 2 or 3. YES. Overlaps
sequence 2. Make contig and enter into
database
2 ACCGTCGCCCTGCCCGTAGCTG
52Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into
database Sequence 5 Compare with database Does it
overlap with Sequences 1, 2 or 3. YES. Overlaps
sequence 2. Make contig and enter into
database
2 ACCGTCGCCCTGCCCGTAGCTG 5
CCCGTAGCTGCCATTTTCGA
53Assembling contigs
Sequence 4 Compare with database Does it overlap
with Sequences 1, 2 or 3. No. Enter into
database Sequence 5 Compare with database Does it
overlap with Sequences 1, 2 or 3. YES. Overlaps
sequence 2. Make contig and enter into
database
2 ACCGTCGCCCTGCCCGTAGCTG 5
CCCGTAGCTGCCATTTTCGA CONTIG ACCGTCGCCCTGCCCGTAGCTG
CCATTTTCGA
54Joining contigs
2 Large contigs Sequence overlapping both is
found Contigs joined
55Filling gaps
- As the sequence accumulates, there are
diminishing returns. New sequences become rarer
and some areas sequenced many times. - So there are gaps which need to be filled.
-
56Gaps
- Sequence gaps where by random chance no sequence
has been obtained - Physical gaps where the region has not been
cloned at all, so no sequence can be obtained.
57(No Transcript)
58Success of this approach
- Very many microbial genomes have been sequenced
in this way
59Sequencing large genomes
- Argued that a clone-contig approach is best,
particularly because organisms with larger
genomes often contain a lot of repetitive
sequences and it difficult to join these
correctly if only short sequences are analysed.
60Clone contig approach
- Relies on cloning large fragments of DNA- e.g.
300kb. These are then mapped onto the chromosome
using physical maps.
61Human genome
- The draft human genome was published by two
groups at the same time in 2001 - The International Human Genome Consortium, a
group of scientists funded by non-profit making
bodies. Used the clone-contig procedure (Nature
409, 860-921). - A private company, Celera. Used the shotgun
approach, which was much faster, but did use
scaffolding data already put into the public
domain by the first group (Science 291,
1304-1349).
62Pyrosequencing
- Recently a different method, pyrosequencing, more
suited to ultra high throughput has been
developed - http//www.pyrosequencing.com/DynPage.aspx?id7454
63(No Transcript)
64Step 1
- The reaction contains a primer, template and DNA
polymerase, but also a number of other
components- ATP sulfurylase, luciferase and
apyrase, and the substrates, adenosine 5
phosphosulfate (APS) and luciferin.
65Step 2
- The first of four dNTPs is added to the reaction.
- DNA polymerase catalyzes the incorporation of the
deoxynucleotide triphosphate into the DNA strand,
if it is complementary to the base in the
template strand. - Each incorporation event is accompanied by
release of pyrophosphate (PPi) in a quantity
equimolar to the amount of incorporated
nucleotide.
66Step 3
- ATP sulfurylase quantitatively converts PPi to
ATP in the presence of adenosine 5
phosphosulfate. - This ATP drives the luciferase-mediated
conversion of luciferin to oxyluciferin that
generates visible light in amounts that are
proportional to the amount of ATP. - The light produced in the luciferase-catalyzed
reaction is detected by a charge coupled device
(CCD) camera and seen as a peak in a program. - Each light signal is proportional to the number
of nucleotides incorporated- this gives a
different sort of output from dideoxynecleotide
sequencing.
67(No Transcript)
68Step 4
- Apyrase, a nucleotide degrading enzyme,
continuously degrades unincorporated dNTPs and
excess ATP. - When degradation is complete, another dNTP is
added.
69(No Transcript)
70Step 5
- Addition of dNTPs is performed one at a time. As
the process continues, the complementary DNA
strand is built up and the nucleotide sequence is
determined from the signal peak in the pyrogram.
71(No Transcript)
72- The method can be automated considerably. Random
shearing of genomic DNA, PCR amplification and
complex sample handling methods mean that around
400,000 fragments can be sequenced at the same
time- each sequence being 200-300 nucleotides. - These can be automatically assembled into
contigs, the only problems being repetitive
sequences due to the small size of the sequences
generated. This is called 454 sequencing.
73Pathogen detection
- e.g. Briese et al (2009) PLOS Pathogens 5,
e1000455 - Lujo virus- Arenaviridae.
- Case of haemmorrhagic disease
- RT-PCR- random amplification, ligation of
specific linkers, 454 sequencing
74Pathogen detection
- Worked with 3 libraries from different tissue
- 87,500-106,500 reads from each
- Found 7 sequence fragments matching with
areanvirus - Completed gaps using conventional PCR
75(No Transcript)
76(No Transcript)
77(No Transcript)
78New respiratory viruses
79Sequence databases
- There are a number of different databases of
different types. - Nucleotide GenBank and EMBL are the main ones
for well characterised sequences htgs contains
unfinished High Throughput Genomic Sequences
(i.e. from genome projects) until they have been
characterised further.
80More databases
- Protein PIR and swissprot are the main ones.
- Global nr (non-redundant). This is a compilation
of several databases. - ESTs dbest
81ESTs (expressed sequence tags)
- Short sequences obtained from total mRNA isolated
from a tissue. - Derived by random cDNA cloning and sequencing
without further purification. - Useful to show which genes expressed in a tissue-
as only these represented in the RNA. - A collection of ESTs from different tissue gives
an idea of the total number of genes in an
organism.
82Accessing databases
- http//www.ncbi.nlm.nih.gov/
- Simple search terms
83(No Transcript)
84Identifying genes in nucleotide sequences
- http//www.ncbi.nlm.nih.gov/books/bv.fcgi?indexed
googleridsef.section.168 - Initially the sequence generated from a genome
project is largely featureless and needs to be
interpreted, the most important things to find
being the locations of the genes.
85 1 tttgaaaggg gtctcctaga gagcttggcc
gtcgggcctt acaccccgac ttgctgagtt 61
tctctaggag agtccctttc ccagccagag gtggctggtc
aaacaatacc aaacgtaact 121 aaacatctaa
gataacatag ccctatgcct ggtctccacc agttgaaggc
atcttgcaat 181 aaaatgggtg gattaagacg
cttaaagcat ggagtcaatt atcttttcta actagtgatc
241 ttcactgggt ggcagatggc gtgccataac tctattagtg
ggataccacg ctcgtggatc 301 ttatgcccac
acagccatcc tctagtaagt ttgcaaggtg tctgatgagg
cgtgggaact 361 tattggaaat aattacttgc
tgcgaagcat cctactgcca gcggatcaac acctggtaac
421 aggtgcccct ggggccaaaa gccacggttt aacagaccct
ttaggattgg ttaaaacctg 481 agtaattatg
gaagatactt agtacctacc aacttggtaa cagtgcaaac
actagttgta 541 aggcccacga aggatgccca
gaaggtaccc gcaggtaaca agagacactg tggatctgat
601 ctggggccac ctacctctat cctggtgagg tggttaaaaa
acgtctagtg ggccaaaccc 661 aggggggatc
cctggtttcc ttattttagt gtaaatgtca ttatggagac
aatcaagagc 721 attgcagata tggcgaccgg
tgtaactaaa accattgatg ccacaatcaa ttctgttaat
781 gagatcatca ctaacacaga taatgcttca ggtggagata
tattgactaa agttgctgat 841 gatgcttcaa
atattttagg gcccaactgt tatgcgacaa catctgagcc
agaaaacaag 901 gatgtggtgc aagcaaccac
cactgtgaac accactaatc tgacacagca cccatcagca
961 ccaacgttac catttacacc agacttttcg aatgttgaca
cgtttcattc aatggcttat 1021 gatactacaa
ctggtagtaa gaaccctaat aagttagtta ggttaacgac
acatgcttgg 1081 gctagtaccc tacagagggg
tcatcagatt gatcatgtta atctaccagt tgacttctgg
1141 gatgaacaga ggaaaccagc ttatggccat gctaaatatt
ttgcagctgt tcggtgtgga
861201 tttcattttc aagtacaggt caatgtgaat cagggaactg
ctgggagtgc tttggtagtg 1261 tatgaaccaa
agccagtagt tgattatgat aaggatttgg aatttggagc
atttaccaat 1321 ttaccacatg tgttaatgaa
cttggccgag actacccagg ccgacttatg tatcccctat
1381 gttgcagata caaactatgt gaagactgat tcatctgact
tagggcaatt gaaagtttat 1441 gtgtggactc
cccttagcat tccatcaggc tcatctaacc aagtggacgt
gactatattg 1501 ggtagcttat tacaattgga
tttccaaaac ccaagggtgt atgggcaaaa tgttgacatt
1561 tacgatacag caccctctaa accaattcca ttgaggaaga
ctaaatattt gactatgagc 1621 acaaaataca
aatggacaag aaataaagta gacatagctg aaggtccagg
ttcaatgaac 1681 atggcaaatg tacttagtac
gacagcagca caatcagtag cattggttgg ggagagggct
1741 ttttatgatc ccaggactgc tggtagcaaa tctagatttg
atgacttagt aaaaatctca 1801 cagttgtttt
cagttatggc agattccacc actccatctg ccaatcatgg
aatagaccaa 1861 aagggttatt tcaaatggtc
tgccaattct gatccacagg caatagtgca tagaaactta
1921 gttcatttaa atctatttcc aaatttgaag gtctttgaaa
acagttattc atacttcaga 1981 ggttctctta
taatcaggtt aagtgtttat gctagtacat tcaacagagg
ccgtttgaat 2041 gggttctttc caaattccag
tacagatgaa acttctgaaa ttgataatgc catctacacc
2101 atatgtgata ttggatctga caatagtttt gagattacta
tcccttattc attttccact 2161 tggatgagga
agacacatgg taaacctatt ggcctattcc agattgaagt
cctaaatagg 2221 ttaacataca attactccag
tccaaatgag gtatactgca tagtgcaagg taaaatggga
2281 caagacgcca aatttttctg ccccactggg tctttagtaa
ctttccagaa ttcatggggt 2341 tcccaaatgg
acttgactga cccgctttgc atagaagatt cagtagaaga
ttgtaagcaa
1201 tttcattttc aagtacaggt caatgtgaat
cagggaactg ctgggagtgc tttggtagtg 1261
tatgaaccaa agccagtagt tgattatgat aaggatttgg
aatttggagc atttaccaat 1321 ttaccacatg
tgttaatgaa cttggccgag actacccagg ccgacttatg
tatcccctat 1381 gttgcagata caaactatgt
gaagactgat tcatctgact tagggcaatt gaaagtttat
1441 gtgtggactc cccttagcat tccatcaggc tcatctaacc
aagtggacgt gactatattg 1501 ggtagcttat
tacaattgga tttccaaaac ccaagggtgt atgggcaaaa
tgttgacatt 1561 tacgatacag caccctctaa
accaattcca ttgaggaaga ctaaatattt gactatgagc
1621 acaaaataca aatggacaag aaataaagta gacatagctg
aaggtccagg ttcaatgaac 1681 atggcaaatg
tacttagtac gacagcagca caatcagtag cattggttgg
ggagagggct 1741 ttttatgatc ccaggactgc
tggtagcaaa tctagatttg atgacttagt aaaaatctca
1801 cagttgtttt cagttatggc agattccacc actccatctg
ccaatcatgg aatagaccaa 1861 aagggttatt
tcaaatggtc tgccaattct gatccacagg caatagtgca
tagaaactta 1921 gttcatttaa atctatttcc
aaatttgaag gtctttgaaa acagttattc atacttcaga
1981 ggttctctta taatcaggtt aagtgtttat gctagtacat
tcaacagagg ccgtttgaat 2041 gggttctttc
caaattccag tacagatgaa acttctgaaa ttgataatgc
catctacacc 2101 atatgtgata ttggatctga
caatagtttt gagattacta tcccttattc attttccact
2161 tggatgagga agacacatgg taaacctatt ggcctattcc
agattgaagt cctaaatagg 2221 ttaacataca
attactccag tccaaatgag gtatactgca tagtgcaagg
taaaatggga 2281 caagacgcca aatttttctg
ccccactggg tctttagtaa ctttccagaa ttcatggggt
2341 tcccaaatgg acttgactga cccgctttgc atagaagatt
cagtagaaga ttgtaagcaa
87Prokaryotes and archaea
- Genes are usually easily seen as they contain no
introns and the genome is very gene-rich with few
spaces between genes. - A simple search for open reading frames (ORFS)
can often identify the genes. So, translation of
a DNA sequence in all six reading frames is
performed using, for example, the Translate tool
on the ExPASy server. (http//www.expasy.org/tools
/dna.html).
88Why 6 reading frames?
89Why 6 reading frames?
- Ribosomes read an RNA sequence in triplets
- GTC GCG ACT AGA ACT CGT GCT AAA
- Val Ala Thr Arg Thr Arg etc
- G TCG CGA CTA GAA CTC GTG CTA AA
- Ser Arg Leu Glu Leu Val etc
- GT CGC GAC TAG AAC TCG TGC TAA A
- Arg Asp - Asn Ser Cys etc
90Why 6 reading frames?
- So 3 reading frames, but DNA is double stranded
- Only one strand is usually shown to save space,
but the other strand could be the one actually
used - This makes a second set of 3 frames, so 6 in all
91- GTCGCGACTAGAACTCGTGCTAAA
- CAGCGCTGATCTTGAGCACGATTT
92e.g. A section of the E. coli genome
93- Most genes have ORFS of at least 100bp and
often the longest ORF in a region is the gene.
This is not always the case and so other criteria
are also employed to analyse the predicted genes - The ORF may encode a protein similar to
previously described ones - The ORF may have a typical GC content, codon
frequency, or oligonucleotide composition for
known protein-coding genes from the same
organism). - The ORF may be preceded by a typical
ribosome-binding site - The ORF may be preceded by a typical promoter ( a
region that controls gene expression)
94Some unicellular eukaryotes
- The few introns and high gene density make gene
prediction not as difficult as in higher
eukaryotes- genes can be confirmed using similar
methods to prokaryotes. - Some, however, do have genes with several introns
and short ORFS. - Here ESTs can be very useful in identifying
genes. - By definition an EST comes from an expressed
region of DNA, hence a gene.
95Most multicellular eukaryotes
- Gene organization is so complex that gene
identification is a major problem. - Here there are often large intergenic regions,
and also the genes themselves contain numerous
introns, many of them long. - An added complication is the fact that many
proteins exist in different forms due to
alternative splicing and it is important to
identify these variants as they could be related
to disease or to functions in different tissue
types.
96Most multicellular eukaryotes
- Again ESTs are important in defining genes.
- Exon boundaries can be predicted- often GT at the
5' end and AG at the 3' end. - Similar sequences in other organisms are very
useful - Statistical analysis of CG content (differ in
coding regions), CpG islands (located close to
genes)
97Organization of the human iduronate 2-sulfatase
gene
- This gene is located in positions 152960177995
of human X chromosome - Encodes a 550-aa protein
- Mutations in this gene cause mucopolysaccharidosis
type II, also known as Hunter's disease - Tissue deposits of chondroitin sulfate and
heparan sulfate. - Symptoms of Hunter's disease include coarse
facial features, hepatosplenomegaly,
cardiovascular disorders, deafness, and, in some
cases, progressive mental retardation.
98- The top line indicates the X chromosome and shows
the location of the iduronate sulfatase gene
(thick line in the middle). - Thin lines on the bottom indicate two alternative
transcripts. - Exons are shown with small rectangles.
99(No Transcript)
100(No Transcript)