Bioinformatics Datamining Bacterial Genome Sequences - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Bioinformatics Datamining Bacterial Genome Sequences

Description:

Into the twilight zone... The search for distant homologies. Signal Peptide. A. B ... Sortase-sorted proteins could provide novel vaccine targets, could help track ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 59
Provided by: markp95
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Datamining Bacterial Genome Sequences


1
BioinformaticsData-mining Bacterial Genome
Sequences
  • Professor Mark Pallen
  • University of Birmingham

2
BioinformaticsDefinitions
  • Fusion of Biology, Computer science, Mathematics
  • Broad Meaning
  • Any computationally intensive research with
    biological relevance
  • Narrow meaning
  • Computer-based analysis and archiving of
    macromolecular sequence data

3
The Scope of Bioinformatics
Algorithm Development
  • Domain Hunting
  • Detailed Analysis by Expert
  • Interface Design
  • Graphics
  • Web

Large-scale (semi-) Automated Analysis
4
Scale
  • Large-scale
  • (semi)-automated analysis of genome sequence
  • Interface with functional genomics
  • Medium-scale
  • Domain hunting
  • Small-scale
  • Analysis of individual sequence by power-user

5
(No Transcript)
6
BioinformaticsEnabling Technologies
  • Maths
  • Algorithm Development
  • Computer science
  • Ever-increasing list of free software
  • Bioinformatics programs
  • Operating system LINUX
  • Scripting languages glue programs together
  • PERL
  • Growth of Internet
  • Distance is dead!, Distributed Resources
  • User-friendliness of Web
  • Just Cut and Paste!

7
BioinformaticsChallenges
  • DNA Protein Sequences
  • Exponential increase
  • Genome Sequencing
  • Need for annotation
  • Molecular stamp collecting?
  • Role in drug discovery
  • Big business

8
BioinformaticsChallenges
  • Interplay between the wet and the dry
  • Bioinformatics Predictions
  • range from the very general to the very specific
  • range from highly speculative to the almost
    certain
  • But in the end they are still only predictions
  • Need for experimental confirmation

9
Challenges The Data Flood
  • 52 bacterial genomes completed and published
  • 100,000 genes
  • 228 genomes ongoing
  • 450,000 more genes when finished

10
The Post-Genomic Iceberg
Discovered Biology
The Undiscovered Genotype Most genes are of
unknown function Undiscovered genomic diversity
Undiscovered Biology
The Undiscovered Phenotype Most bacterial
physiology inapparent in the lab Undiscovered
regulators and regulons
11
Bioinformatics Approaches
  • Multiple levels of analysis
  • Gene Finding
  • Protein function prediction
  • Power of homology
  • Pitfalls of homology
  • Comparative genomics
  • Metabolism reconstruction
  • Interface with functional genomics

12
What is a Sequence?
  • DNA Sequence, double stranded, antiparallel
  • Conventionally written 5 to 3
  • 5-ATGAGTACCG CTAAATTAGT TAAATCAAAA-3
  • 3-TACTCATGGC GATTTAATCA ATTTAGTTTT-5
  • RNA sequence, single stranded, U instead of T
  • 5-AUGAGUACCG CUAAAUUAGU UAAAUCAAAA-3
  • Protein sequence
  • conventionally written N-terminal to C-terminal
  • 3-letter code Met Ser Thr Ala Lys Leu
  • 1-letter code MSTAKLVKSKATN
  • Sequences usually written in a monospaced font
    like Courier
  • Times Courier
  • AGCGGGCGG AGCGGGCGG
  • ATCGTTCTG ATCGTTCTG

13
(No Transcript)
14
First get your sequence!
  • Most sequencing is now...
  • Performed on DNA (rather than RNA or protein)
  • Performed using the Sanger didexy method
  • Exceptions...
  • rRNA is sometimes sequenced directly
  • N-terminal and mass spectrometry sequencing of
    proteins
  • Template for sequencing can be
  • DNA cloned in a plasmid (e.g. pUC19)
  • DNA cloned in a single-stranded phage (e.g. M13)
  • PCR products

15
First get your sequence!
  • Automated Sequencing
  • Fluorescent dyes used
  • Extract sequence from chromatogram
  • Must extract only the error-free region

16
Where to analyse?
17
Assembling a contig
18
Analysis of nucleotide sequence data
  • Sequence Composition
  • bacteria vary greatly in chromosomal GC
    content each genus has characteristic GC
  • useful in
  • checking you cloned what you think you have
  • identifying foreign DNA in a genome
  • Predicted Restriction maps
  • helpful in the lab

19
Analysis of nucleotide sequence data
  • Search for Other Sequence Features
  • Promoters
  • Ribosome-binding Sites
  • Repeats
  • Inverted Repeats (e.g. terminators)
  • Consensus Sequences for regulator binding sites

20
Searching for coding regions
  • Any given DNA sequence can be translated in 6
    different reading frames, 3 on each strand

21
ORF maps
22
The Problem of Frameshift Errors
Actual sequence
10 20 30 40
50 60 70

ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTAT
ACCCGCAACGATGTCTCCGACAGCGAGAAA M S T A K L
V K S K A T N L L Y T R N D V S D
S E K V P L N L N Q K R P I
C F I P A T M S P T A R K E Y
R I S I K S D Q S A L Y P Q R
C L R Q R E K
10 20 30 40
50 60 70

ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTA
TACCCGCAACGATGTCTCCGACAGCGAGAA M S T A K L
V K S K S D Q S A L Y P Q R C L R
Q R E V P L N L N Q K A T N
L L Y T R N D V S D S E K E Y
R I S I K K R P I C F I P A T
M S P T A R K
Frameshifted sequence after single base error
Markov Models (GLIMMER) now commonly used to
predict coding regions
23
Analysis of Protein Sequence Data
24
Analysis of Protein Sequence Data Signal peptides
SignalP uses neural networks
25
Homology
  • Similarity that arises because of descent from a
    common ancestor
  • The formation of different languages and of
    distinct species, and the proofs that both have
    been developed through a gradual process, are
    curiously parallel We find in distinct languages
    striking homologies due to community of descent,
    and analogies due to a similar process of
    formation Languages, like organic beings, can be
    classed in groups under groups and they can be
    classed either naturally according to descent, or
    artificially by other characters The survival or
    preservation of certain favoured words in the
    struggle for existence is natural selection.
  • Charles Darwin, 1871 THE DESCENT OF MAN, Chapter
    3

26
Homology
the cat sat on the mat die Katze sass auf
der Matte
vgeGBant88-2 ITLITCVSVKDNSKRYVVAG vgeGEfae9-1
78 LTLITCDQATKTTGRIIVIA vgeGSpne1-403
MTLITCDPIPTFNKRLLVNF sortase_staur
LTLITCDDYNEKTGVWEKRK
27
Homology
Sequence homology is not just sequence similarity!
Sequences 1, 1A, 1B and 2 are all homologous to
one another Another sequence 2 is similar to
sequences 1, 1A, 1B 2, but not homologous to
them as it does not share a common ancestor with
them Another sequence 1 is neither homologous
nor similar to sequences 1, 1A, 1B 2
28
Types of Homology
29
Sequence Databases
  • All sequences when published are deposited in
    Sequence Databases
  • Nucleic Acid Sequence Databases
  • EMBL, Heidelberg, http//www.embl-heidelberg.de/
  • GenBank, in the NCBI, USA, http//www.ncbi.nlm.nih
    .gov/
  • Protein Sequence Databases
  • GenPept and TREMBL
  • Curated database SwissProt, Geneva,
    http//www.expasy.ch/sprot/
  • Numerous others, reviewed every year in NAR
  • Problem of sequence formats
  • Simplest format is FASTA
  • gtsequence name
  • AATGATGCGTGATGATGATGATGACTGACTGATGATGAT

30
Homology Searches
  • The aim of homology searches is to identify
    sequences within these databases that are
    homologous to your sequence.
  • This involves comparing your sequence with all
    the database sequences, looking for stretches of
    sequence that appear to be similar, then scoring
    the matches and ranking them. Usually a measure
    of the significance of the match will be given.

31
Homology Searches Translate first!
32
Homology Searches with BLAST
  • BLASTN
  • Nucleotide query vs nucleotide database
  • BLASTP
  • protein query vs protein database
  • BLASTX
  • automatic 6-frame translation of nucleotide query
    vs protein database
  • TBLASTN
  • protein query vs automatic 6-frame translation of
    nucleotide database
  • TBLASTX
  • automatic 6-frame translation of nucleotide query
    vs automatic 6-frame translation of nucleotide
    database

33
Typical Blast Output
Sum
Reading High Probability Sequences
producing High-scoring Segment Pairs
Frame Score P(N) N embX69337ECDPS
E.coli dps gene for binding protein 2 834
6.4e-109 1 gbU04242ECU04242 Escherichia
coli core starvation p... 3 828 2.7e-106
1 embX14180ECGLNHPQ Escherichia coli glutamine
permeas... 3 443 2.8e-53
1 gbU18769HDU18769 Haemophilus ducreyi fine
tangled p... 1 150 4.0e-18 2
dbjD01016ANALTI46 Anabaena variabilis lti46
gene. gte... 2 129 4.8e-12 2
gbM84990P26BPO Plasmid pOP2621 ORF1 gene,
5' end... -2 131 6.7e-09
1 gbU16121HPU16121 Helicobacter pylori
neutrophil act... 1 112 1.8e-06
1 gbM32401TRPTYF1 T.pallidum pallidum
antigen TyF1 g... 3 101 5.6e-06
2 embX71436RPNTRB R.phaseoli ntrB gene 1
67 0.76
2 gbL35598DRODGC1A Drosophila melanogaster
receptor g... 1 48 0.97 3
34
Typical Blast Output
gbU18769HDU18769 Haemophilus ducreyi fine
tangled pili major pilin subunit gene Length
780 Plus Strand HSPs Score 150 (68.0 bits),
Expect 4.0e-18, Sum P(2) 4.0e-18 Identities
36/89 (40), Positives 46/89 (51), Frame
1 Query 30 ELLNRQVIQFIDLSLITKQAHWNMRGANFIAVH
EMLDGFRTALIDHLDTMAERAVQLGGV 89 E L
LLI K AHWN G FIAVHEMLD D D AER
LG Sbjct 253 EALQMRLQGLNELALILKHAHWNVVGPQFIA
VHEMLDSQVDEVRDFIDEIAERMATLGVA 432 Query 90
ALGTTQVINSKTPLKSYPLDIHNVQDHLK 118
G YPL QDHLK Sbjct 433
PNGLSGNLVETRQSPEYPLGRATAQDHLK 519
35
Sequence alignments
Dps_trepo .........N MCTDGKKYHS TATSAAVGAS
APGVPDARAI AAICEQLRRH Dps_helico ..........
.......... .......... .......... MKTFEILKHL
Dps_anab .......... .......... .......MPR
INIGLTDEQR QGVINLLNQD MrgA ..........
.......... .......... MKTENAKTNQ TLVENSLNTQ
Dps_haemo MRSKTITFPV LKLTGQSQAL TNDMHKNADH
TVPGLTVATG HLIAEALQMR Dps_ecoli ..........
....MSTAKL VKSKATNLLY TRNDVSDSEK KATVELLNRQ
Dps_strep ........MT SQPHLHQHAA EIQEFGTVTQ
LPIALSHDAR QYSCQRLNRV Dps_trepo VADLGVLYIK
LHNYHWHIYG IEFKQVHELL EEYYVSVTEA FDTIAERLLQ
Dps_helico QADAIVLFMK VHNFHWNVKG TDFFNVHKAT
EEIYEEFADM FDDLAERLVQ Dps_anab LADSYLLLVK
TKKYHWDVVG PQFRSLHQLW EEHYEKLTEN IDAIAERVRT
MrgA LSNWFLLYSK LHRFHWYVKG PHFFTLHEKF
EELYD..... .HAAETWIPS Dps_haemo LQGLNELALI
LKHAHWNVVG PQFIAVHEML DSQVDEVRDF IDEIAERMAT
Dps_ecoli VIQFIDLSLI TKQAHWNMRG ANFIAVHEML
DGFRTALIDH LDTMAERAVQ Dps_strep LADTQFLYAL
YKKCHWGMRG PTAYQLHLLF DKHAQEQLEL VDALAERVQT
ClustalW most commonly used program Note problems
of indels and ragged ends Need for manual
refinement Multiple alignments useful for
identifying active sites and distant homology
36
Into the twilight zoneThe search for distant
homologies
Signal Peptide
A
Proteins consist of domains
B
Signal Peptide
Transitivity of Homology
Coiled coil domain
C
Distant Homology
D
37
Domain Hunting
Homology Search
Add to Alignment
38
PSI-BLAST Position-Specific Iterated BLAST
  • combines statistically significant alignments
    produced by BLAST into a position-specific score
    matrix
  • searches the database using this matrix
  • allows multiple iterations of this process
  • runs at approximately the same speed per
    iteration as gapped BLAST
  • is much more sensitive to weak but biologically
    relevant sequence similarities

39
Role of Sequence Analysisin the Pre-Genomic Era
Confirm
  • Sequence Analysis
  • Homology
  • Structural Features

Identify Clone Gene
Obtain Sequence
40
Role of Sequence Analysisin the Post-Genomic Era
Obtain Sequence from Genome Project
Formulate Hypothesis
  • Sequence Analysis
  • Homology
  • Structural Features
  • Genomic Context

Novel Lab-based Experimental Programme Amplify
Clone Express Gene Create mutant etc.
41
(No Transcript)
42
Case studies
  • ViruloGenome has resulted in four publications on
  • Sortase
  • ARTs
  • Tricorn proteases
  • The ESAT-6 superfamily
  • and has been cited in one other paper on
    Bestrophin

43
Sortase and its targets
  • Many proteins in Staphylococcus aureus sorted on
    to bacterial surface by an enzyme sortase
  • Many of these sortase-sorted proteins involved
    in causing disease (wound infection, septicaemia
    etc)
  • But unclear how common this mechanism of sorting
    proteins on to bacterial surfaces is in other
    similar Gram-positive bacteria
  • Novel sortase-sorted proteins in S. aureus and
    other bacteria could also be involved in
    diseaseroute to novel vaccine, drug and
    epidemiological targets

44
Searching new genomes
  • Multiple sortase-like proteins in almost all
    Gram-positives
  • examples include bacterium causing pneumonia,
    diarrhoea, botulism, anthrax, diphtheria
  • all that have one have at least two
  • up to six in C. diphtheriae
  • genes clustered with those for sortase-sorted
    proteins
  • Sortase in a sludge-dwelling archaean,
    Methanobacterium!
  • Sortase in a Gram-negative bacterium, Shewanella,
    with a sortase-sorted protein!
  • New sortase-sorted proteins also found in many
    different kinds of bacteria
  • Eight new proteins in S. aureus
  • Four new proteins in S. epidermidis

45
vgeGSpyM35-21 202 ----EVTLVTCT-DIEATERIIVKG vge
GSmut53-76 199 ----EVTLVTCT-DAGATARTIVHG vgeGE
fae12-38 194 ----MITLITCG-DLQATTRIAVQG vgeGSpne
1-402 180 ----YATLLTCTPYMINSHRLLVRG vgeGSpne1-4
00 212 ----YVTLLTCTPYMINTHRLLVRG vgeGCdip26-60
233 ----LVTLVTCTPLGINTHRILVTA 3036999_acnae
193 ----LLTLVTCTPLGINTHRILLTG vgeGCdip28-52
224 ----YITLITCTPYGINTHRLMVRG vgeGCdip28-55
243 ----QVTLITCTPYGINTHRLIITA 6580649_stcoe
191 ----ELRVITCG------GGFTKQN 6759575_stcoe
180 ----ELRVITCG------GPYSRST 7160127_stcoe
173 ----EVRLITCA------GDYDHKV vgeGCdip22-146
216 ----QITLITCTPYAVNSHRLLVRA vgeGCdip26-61
245 ----LITLVTCTPYGINTHRLLVTA vgeGEfae24-45
209 ----LVTLLTCTPYMINSHRLLVRG vgeGSpyo1-123
176 ----YATLVTCTPYGVNTKRLLVRG vgeGSequ392-28
190 ----LVTLVTCTPYGVNTHRLLVRG 7211004_stcoe
177 ----ELRLITCG------GPRDGQE 6689174_stcoe
207 ---SELRLITCG------GTFDQTT ywpE_bacsu
73 -----ITLITCDKAVKTEGRLVVKG vgeGBant88-2 205
-----ITLITCVSVKDNSKRYVVAG vgeGEfae9-178 220
-----LTLITCDQATKTTGRIIVIA vgeGSpne1-403 200
----IMTLITCDPIPTFNKRLLVNF sortase_staur 178
----QLTLITCDDYNEKTGVWEKRK YHCS_BACSU 171
----ELILTTCYP-------FSYVG 5102801_stcoe 210
KAGHYITLTTCTPVYTSRYRYVVWG 2622963_methe 170
---ARLMLITCYPPGQKKAAWITHC
46
Implications
  • In disease-causing bacteria
  • Sortase-sorted proteins could provide novel
    vaccine targets, could help track spread of
    infections
  • Sortase-like proteins could prove new drug
    targets
  • Lots of new scientific questions
  • In harmless bacteria
  • overlooked biology of model organisms
  • B. subtilis, S. coelicolor, M. thermoautotrophicum

47
ADP-Ribosylation
  • protein modification
  • transfer of ADP-ribosyl group to a protein by an
    ADP-ribosyltransferase (ART)

48
ADP-Ribosyltransferases
  • Many important bacterial exotoxins ADP-ribosylate
    host cell proteins
  • diphtheria toxin, P. aeruginosa exotoxins A S
  • clostridial toxins
  • EDIN from Staphylococcus aureus
  • Mtx from Bacillus sphaericus
  • VIP2 from Bacillus cereus
  • AB5 toxins cholera toxin, LT-I, LT-II from E.
    coli, pertussis toxin

49
Search for novel ARTs
  • Found nearly two dozen novel ARTs
  • similarities patchy, but convincing
  • Many could just be house-keeping proteins
  • but at least two appear to be new toxins

50
NARG_HUMAN 174 CHQVFRGVH 16 GFASASLKHVAAQ 24
PGEEEVLIPP 82 9945866 317 VVKTFRGTQ 17
GYLSTSRDPGVAR 25 GDEQEILYDK 67 ALT_BPT2 476
GITVYRAQS 18 NFVSTSLTPIIFG 69 ATEAEMLFPP 103
ART1_PSESM 182 SGQLHRGIK 21 AFMSTSTRMDVTE 24
PYEDEALIPP 29 ART3_PSESM 193 SSQLHRGIK 21
AFMSTSTHMQVSE 24 PYEDEALISP 29 ART4_PSESM 159
RGTVYRGIR 21 AFMSTTRIKDSAQ 28 PSEEEIMLPM
17 ART5_PSESM 70 KETLYRGIN 21 TFVSATPDLSTVN 30
SAEEEGIFAP 33f ART2_PSESM 152 TTCLYRPIN 60
AFMSTSTDSVIAN 24 AYEKEAIIPP 30f ART1_ENTFA
389 DIITYRGVS 14 EFKSTSINKKVAE 34 KKESEFLLNR
21 ART2_ENTFA 337 SDDLVRFAN 11 EYISTTKDIYSEQ 18
NSEKEVLFNR 25 ART_STREPY 152 LTSYYRNHQ 18
SFMSTTALKNGAM 26 PSEVELLFPR 28 EDIN_STAAU 150
DENLVRKLN 11 GYSSTQLVSGAAV 28 YGQQEVLLPR
40 ART_MYCAV 90 EGAVVRGTN 18 AFTSTSTDHTVAQ 26
PDEKEILFPA 16 6730537 343 NITVYRWCG 30
GYMSTSLSSERLA 28 ASEKEILLDK 42 ART_CLOAB 275
DMKVYRGTD 29 GFMSTALVKESSF 25 PDEAELLLNH
27 10176155 382 GIMVYRNVG 22 GYMSTSVLREGAF 26
FKDEEYEFLI 27 10336588 152 PISVYRIVS 18
GFISSGTDKTQML 31 GRTLEALFPP 42 VRP2_SALTY 466
HRVVYRGLK 23 AFMSTSPDKAWIN 23 KGEAEMLFPP
61 TOX1_BORPE 3 PATVYRYDS 36 AFVSTSSSRRYTE 63
TYQSEYLAHR 101 YD72_MYCPN 5 VRFVYRVDL 32
YFISTSETPTAAI 59 AYQREWFTDG 454 TOXA_SALTI 1
VDFVYRVDS 37 YIATTSSVNQTYA 51 RLQREYVSTL
104 TOX2_SALTI 1 VDFVYRVDS 37 YIATISDINEAYN 51
RLQSEYVSVN 52f 5902922 65 RQRLVRWDR 39
IFVSTTKTQRNKK 38 PNQMEVAFPG 680 7105990 76
CGTLYRSDS 31 PYVSTSYDHDLYK 54 QTKTEIMSDC
11 143205 92 EHRLLRWDR 38 IFVSTTRA-RYNN 41
PNEDEITFPG 668 CHTA_VIBCH 1 DDKLYRADS 47
GYVSTSISLRSAH 37 PDEQEVSALG 75 ELAH_ECOLI 2
GDKLYRADS 47 GYVSTSLSLRSAH 37 PYEQEVSALG
119 ART2_BURCE 1f --VLYRPSC 41 GYLGAFRKLDTAT 35
PEDGEVAALG 415 ART1_BURCE 380 EGALYRPSA 41
GYLGTYRSAFTAQ 32 PAQGEVAAMG 401 ART4_BURCE 438
PFALFRPSQ 41 GYVGTFTGRAEAQ 36 PRDGEVAAMG
404 ART3_BURCE 609 RGLLFRPST 41 GYLGTFQYGRTAL 34
PHIHEQAAMG 402 ART5_BURCE 772 PWVLFRPST 41
GYVGAFRLPSTAL 34 PDNQEFAAVG 405
R S E
51
ESAT-6
  • Small secreted protein from M. tuberculosis of
    key importance as virulence factor and T-cell
    antigen
  • Missing in BCG, defined mutant attenuated
  • Useful in immunodiagnosis
  • Multi-protein family in M. tuberculosis
  • Hard to investigate in difficult-to-handle
    mycobacteria
  • But said to have no homologues outside
    mycobacteria and their close relatives...

52
The ESAT-6/WXG100 superfamily
53
VGE PSI-BLASTSearch for novel ESAT-6 homologues
54
Bestrophin
  • Best macular dystrophy or vitelliform macular
    dystrophy type-2
  • human genetic disorder
  • visual impairment egg-like lesions in the macula
  • caused by mutations in a gene for bestrophin,
    function unknown
  • VGE-PSI-BLAST searches by Goodstadt Ponting
    found homologues of bestrophin in several species
    of bacteria, including the model laboratory
    organism, Escherichia coli (protein b1520)

55
From Sequence to Consequence
  • Preliminary lab-based studies underway by Lihong
    Zhang on sortase substrates in
  • C. diphtheriae
  • B. anthracis
  • C. difficile
  • S. pneumoniae
  • S. putrefasciens

56
From Sequence to Consequence
  • Preliminary lab-based studies planned on
  • ARTs from S. typhi and S. pyogenes
  • Martin Antonio, Claire Spreadbury with Steve
    Dove
  • ESAT-6/YukA homologues in S. aureus and B.
    subtilis
  • Jen Gifford-Garner, Lihong Zhang, Kate Spanchak
    mutants from Anil Wipat, bacterial 2H from Lars
    Westblade
  • Bestrophin homologue in E. coli
  • Lihong Zhang mutagenesis with Jeff Cole

57
Conclusions
Only connect!
58
Acknowledgements
  • Dosh
  • BBSRC,
  • QUB, Brum
  • Meatware
  • Alex Lam
  • Nick Loman
  • Arshad Kahn
  • Lihong Zhang
  • Sequences
  • provided by Sanger, TIGR, et al. , funded by
    Wellcome et al.
Write a Comment
User Comments (0)
About PowerShow.com