Title: Difference%20Between%20Expected%20and%20Observed%20frequencies
1GENOME SIGNATURES OF MICROBIAL ORGANISMS
IDENTIFIED BY AMINO ACID N-GRAM
ANALYSIS B. Suman Bharathi Advisor Judith
Klein-Seetharaman Forschungszentrum, Juelich,
Germany
2Genome Signatures
- Sequence peptides which occur with unusually high
frequency unlike others in particular organism or
pathogen - Potential applications
- Drug development synthetize drugs which target
genome signature in pathogen - Sensor development use genome signature to
identify organism quickly using antibody
3Approach
- Linguistic approach
- N-gram analysis using toolkit
- What the BLMT toolkit provides
- N-gram statistical analysis
- Definition of signature sequences
- Use of toolkit on Neisseria Meningitidis
0.09
Neisseria meningitidis versus other species n4
0.08
0.07
0.06
0.05
Occurrence of n-gram ()
0.04
0.03
0.02
0.01
0
AAAL
SDGI
LAAA
ALAA
LAAL
AALA
AALL
LLAA
ALLA
AVLA
AAAA
MPSE
AVAA
AAAV
GRLK
EAAA
AEAA
AAEA
AAVA
AAAE
n-gram sequence of length n
4Use of BLMT
- N-gram statistical analysis gives us a detailed
statistical data in terms of frequency of n-grams
and their respective mean and standard
deviations. - We have taken 45 organisms into consideration
bacteria, archaea, mycoplasmas and human - Search for n-grams whose standard deviations are
away from the mean values. - Indicates the difference between expected and
observed values in frequency of the n-grams. - Eventually helps us to see the unsusuality of
this n-gram in the organism unlike the others
compared.
5Difference Between Expected and Observed
frequencies
Xylella(black) Vibrio(red) Ureaplasma(green) Trepo
nema(blue) Thermotoga(yellow)
n-gram
The positive values indicate the over-represented
n-grams while the negative values indicate the
under-represented n-grams
6Initial Points of difference between expected and
observed frequency graph
Xylella(black) Vibrio(red) Ureaplasma(green) Trepo
nema(blue) Thermotoga(yellow)
Ureapasma shows high difference values (approx
0.00021), indicating over-representation of
n-grams compared to expected probability of
occurence in the organism
7Standard deviation away from the mean
- Mycoplasma genitalium(black)
- M.tuberculosis(red)
- M.leprae(green)
- Mesorhizobium(blue)
- Lactococcus(yellow)
- Mycoplasma genitalium(black)
- M.tuberculosis(red)
- M.leprae(green)
- Mesorhizobium(blue)
- Lactococcus(yellow)
Shows distribution of n-gram standard deviations
with both high and low values of difference,
indicating the over-expressed and
under-expressed n-gram values.
8Highest standard deviations away from the mean
- Mycoplasma genitalium(black)
- M.tuberculosis(red)
- M.leprae(green)
- Mesorhizobium(blue)
- Lactococcus(yellow)
Shows initial (highest) values of standard
deviation away from mean N-grams of
M.tuberculosis much higher than M.leprae.
9Comparison of genome size with varying standard
deviations
- Examine the relationship between genome size and
distribution of n-gram standard deviations for
each organism - Human genome taken as reference.
- Compare genome size and standard deviations
within same genus but across different species.
10Size Distribution of Genomes
1.Human 22889476 2.Bacteria_Mesorhizobium_loti
4080256 3.Bacteria_Pseudomonas_aeruginosaPA01 37
30192 4.baceria E_coi0157H7Baceria_Escherichia_coi
O157H7 3229098 5.Bacteria_Escherichia_coliO157H
7EDL933 3228100 6.Bacteria_Escherichia_coliK12 27
26558 7.Bacteria_Mycobacterium_tuberculosisH37Rv 2
666338 8.Bacteria_Bacillus_subtilis 2442200 9.Ba
cteria_Bacillus_halodurans_C125 2384352 10.Bacter
ia_SynechocystisPCC6803 2072748 11.Bacteria_Vibri
o_cholerae_chr1 1725852 12.Bacteria_Deinococcus_r
adioduransR1_chr1 1559376 13.Bacteria_Xylella_fast
idiosa 1490262 14.Archaea_Archaeoglobus_fulgidus
1343990 15.Bacteria_Pasteurella_multocida
1340102 16.Bacteria_Lactococcus_lactis_subsp_lac
tis 1335222 17.Archaea_Aeropyrum_pernix 1280062 1
8.B_Neisseria_meningitidis_serogroupBstrainMC58 11
78096 19.Archaea_Halobacterium_spNRC1 1178038 20.
B_Neisseria_meningitidis_serogroupAstrainZ2491 117
6104 21.Bacteria_thermotoga_maritima 1167344 22.B
acteria_Pyrococcus_horikoshiiOT3 1141216
23.Bacteria_Mycobacterium_leprae_strinTN 1080756
24.A_Methanobacterium_thermoautotrophicum_deltaH 1
054752 25.Bacteria_Haemophilus_influenzaeRd 10455
72 26.Bacteria_Campylobacter_jejuni 1020944 27.Ba
cteria_Helicobacter_pylori_strianJ99 990942 28.Ba
cteria_Helicobacter_pylori26695 986258 29.Archaea
_Methanococcus_jannaschii 970558 30.Bacteriae_Aqu
ifex_aeolicus 968068 31.Archaea_Thermoplasma_acid
ophilum 909164 32.Archaea_thermoplasma_volcanium
903228 33.Bacteria_Chlamydophila_pneumonieaeJ138
735350 34.Bacteria_Chlamydophila_pneumonieaCWL029
725492 35.Bacteria_Chlamydophila_pneumonieaeAR39 7
29896 36.Bacteria_Treponema_pallidum 703414 37.Ba
cteria_Chlamydia_muridarum 646712 38.Bacteria_Chl
amydia_trachomatis 626142 39.Bacteria_Rickettsia_
prowazekii_strain_MadridE 559828 40.Bacteria_Mycop
lasma_pneumoniae 480870 41.Bacteria_Ureaplasma_ur
ealyticum 457608 42.Bacteria_Buchnera_sp_APS 371
470 43.mycoplasma genitalium 352826 44.Bacteria_
Borrelia_burgdorferi 300106
11Size genome graph and varying std deviation values
- Human(black22889476)
- Mesorhizobium(red,4080256)
- P.aeruginosa(green,3730192)
- E_coi0157h7(blue,3229098)
- E_coli0157h7EDl933
- (yellow,3228100)
The organisms are listed in descending order of
genome size. The relation between distribution of
n-gram standard deviations and size is compared.
12Tail end of Genome size and n-gram distribution
of standard deviations
Human(black,22889476) Mesorhizobium(red,4080256) P
.aeruginosa(green,3730192) E_coi0157h7(blue,322909
8) E_coli0157h7EDl933 (yellow,3228100)
Human genome, though largest in size, has low
values of n-gram standard deviation values away
from the mean compared to smaller genomes
13Initial points Genome size and n-gram
distribution of standard deviations
Human(black,22889476) Mesorhizobium(red,4080256) P
.aeruginosa(green,3730192) E_coi0157h7(blue,322909
8) E_coli0157h7EDl933 (yellow,3228100)
Human n-gram std deviation values are almost
equal to Mesorhizobium though Mesorhizobium has
much smaller genome.
14Genome size and n-gram distribution of standard
deviations
- Human (black,22889476)
- E_coliK12(red,2726558)
- M.tuberculosis(green,2666338)
- B.subtilis(blue,2442200)
- B.halodurans(yellow,2384352)
- Synechocystis(brown,2072748)
M.tuberculosis has very high n-gram standard
deviation values. It exceeds the values of human,
despite its smaller genome size.
15Initial points of Genome size and n-gram
distribution of standard deviations
Human (black,22889476) E_coliK12(red,2726558) M.tu
berculosis(green,2666338) B.subtilis(blue,2442200)
B.halodurans(yellow,2384352) Synechocystis(brown,
2072748)
The thickness of lines indicates the genome
size. The thinnest line represents
E_coliK12. Mycobacterium tuberculosis shows
highest values.
16Final points of Genome size and n-gram
distribution of standard deviations
Human (black,22889476) E_coliK12(red,2726558) M.tu
berculosis(green,2666338) B.subtilis(blue,2442200)
B.halodurans(yellow,2384352) Synechocystis(brown,
2072748)
M.tuberculosis and all other organisms here have
n-grams with higher difference values than human.
17Same genus / different species
- 4-grams in M. tuberculosis have much higher
4-gram standard deviations from mean than M.
leprae
18Mycobacterium
M. tuberculosis
M. leprae
19Other Organisms
Neisseria meningitidis
Thermotoga maritima
Synechocystis spec.
Haemophilus influenza
Human
20Conclusions
- n-grams which are at least 30 standard deviations
away from the mean are significant candidates for
genome signatures. - Difference graphs estimate the likelihood of
n-gram observed in an organism. - Genome size graphs there is no specific
relationship between the size of genome and its
standard deviation values. - Same genus and different species, where genome
size is specified There is a noticeable
difference observed between Mycobacterium species
(M.leprae and M.tuberculosis).
21Current and future work
- Find n-gram signatures n-grams in E.coli.
- Explore the relationship between genome size and
distribution of n-gram standard deviations
different species of the same organism. - Find more specific targets to differentiate
species in terms of signature peptides for all
the 44 organisms taken for study. -