Stanislav Luban1,2 - PowerPoint PPT Presentation

About This Presentation
Title:

Stanislav Luban1,2

Description:

Ecarotovora, Ecoli, Pluminescens, Senterica, Sflexneri, Styphimurium, Ypestis 4 ... Family designation: Ecoli 3941194 3941327 Length: 133 Score: 34.114 ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 28
Provided by: stasl
Learn more at: https://kiharalab.org
Category:

less

Transcript and Presenter's Notes

Title: Stanislav Luban1,2


1
Comparative Study of Small RNA and Small Peptides
in Complete Genome Sequences
  • Stanislav Luban1,2
  • Daisuke Kihara2,1
  • 1. Department of Computer Sciences
  • 2. Department of Biological Sciences
  • Purdue University, West Lafayette, IN

2
Introduction Structural Small RNA (sRNA)
  • Genes which produce non-coding transcripts that
    function directly as structural, regulatory, or
    catalytic RNAs
  • Include rRNAs, tRNAs, small nucleolar RNAs,
    spliceosomal RNAs, viral associated RNAs,
    microRNAs, ctRNAs, and others
  • In Rfam (RNA families) database, 34496 sRNA
    entries distributed among 352 known families are
    stored
  • In E. coli, about 50 sRNAs are known

(figure from Rfam database http//www.sanger.ac.u
k/Software/Rfam/)
3
Methods QRNA
  • Model distinctive pattern of mutation
  • Conserved Structural RNA
  • Pattern of compensatory mutations consistent with
    base-paired secondary structure
  • Pair Stochastic Context-Free Grammar Model
  • Conserved Coding Region
  • Pattern of synonymous codon substitutions
  • Pair Hidden Markov Model
  • Other Types of Conserved Regions
  • Approximated by null hypothesis that mutations
    occur position independently, without pattern
  • Pair Hidden Markov Model
  • Scores are log likelihoods used to calculate
    final log odds score for RNA model compared to
    other two models

(Figure Rivas et al, Current Biol. 2001)
4
Procedure for Extracting sRNAs
Extract Intergenic Regions From 30 Sequenced
Genomes
Verify Results Computationally And Experimentally
(Yet To Be Done)
Eliminate Family Regions Not Found Using Both
Query And Database Organism As Source
Perform All Vs. All Nucleotide- Nucleotide BLAST
Extend Regions Within 25 nt Of Other Regions
Causing Them To Include Each Other
Merge sRNA Regions Which Align or Exactly
Overlap Into Families
Run QRNA, Extract Alignments Scoring as sRNAs vs.
Coding and Null Hypothesis Regions
Select Significant Alignments, Concatenate and
Format into QRNA Program Input
Eliminate Alignment Regions Which Overlap gt50
with E. coli Regulatory Regions
5
Genome Data Set
  • 30 Microbial Genomes Used as Queries and
    Databases
  • Gammaproteobacteria
  • Acinetobacter calcoaceticus
  • Blochmannia floridanus
  • Buchnera aphidicola
  • Coxiella burnetii
  • Erwinia carotovora
  • Escherichia coli
  • Haemophilus ducreyi
  • Haemophilus influenzae
  • Pasteurella multocida
  • Photorhabdus luminescens
  • Pseudomonas aeruginosa
  • Pseudomonas putida
  • Pseudomonas syringae
  • Salmonella enterica
  • Salmonella typhimurium
  • Alphaproteobacteria
  • Agrobacterium tumefaciens
  • Brucella melitensis
  • Caulobacter crescentus
  • Mesorhizobium loti
  • Deinococci
  • Deinococcus radiodurans

6
Result Statistics
  • Total number of intergenic regions 94464
  • Average number of intergenic regions per
    organism 3148.8
  • Total combined length of intergenic regions
    16663732 nt
  • Average length of intergenic region 176.4 nt

7
sRNA Length vs. Score Plot
Total 29488 sRNAs
8
Number of sRNA Entries by Organism
1 - Pseudomonas putida 2 - Shigella flexneri 3 -
Xanthomonas citri 4 - Shewanella oneidensis 5 -
Wigglesworthia brevipalpis 6 - Haemophilus
ducreyi 7 - Pseudomonas syringae 8 - Erwinia
carotovora 9 - Escherichia coli 10 - Vibrio
parahaemolyticus 11 - Mesorhizobium loti 12 -
Buchnera aphidicola 13 - Brucella melitensis 14 -
Yersinia pestis 15 - Xylella fastidiosa 16 -
Pseudomonas aeruginosa 17 - Salmonella
enterica 18 - Caulobacter crescentus 19 -
Agrobacterium tumefaciens 20 - Blochmannia
floridanus 21 - Pasteurella multocida 22 -
Deinococcus radiodurans 23 - Vibrio cholerae 24 -
Photorhabdus luminescens 25 - Coxiella
burnetii 26 - Vibrio vulnificus 27 - Salmonella
typhimurium 28 - Acinetobacter calcoaceticus 29 -
Xanthomonas campestris 30 - Haemophilus influenzae
Total 29488 sRNAs
9
Conservation of sRNAs
Total 3768 families
10
Conservation of sRNAs
Along with statistics for all entries, statistics
for entries containing at least one entry from E.
coli were added for comparison
E. Coli Total 554 families
Total 3768 families
11
Common OrganismCombinations in Families
  • Top 5 most frequent combinations of 4 and 7
    organisms
  • Combination Occurances
  • Ecoli, Senterica, Sflexneri, Styphimurium 117
  • Ecarotovora, Ecoli, Senterica, Styphimurium 26
  • Ecoli, Senterica, Styphimurium, Ypestis 20
  • Ecarotovora, Ecoli, Sflexneri, Styphimurium 18
  • Ecoli, Sflexneri, Styphimurium, Ypestis 17
  • Ecarotovora, Ecoli, Pluminescens, Senterica,
    Sflexneri, Styphimurium, Ypestis 4
  • Acalcoaceticus, Ccrescentus, Mloti, Paeruginosa,
    Pputida, Psyringae, Xcampestris 2
  • Acalcoaceticus, Atumefaciens, Ccrescentus, Mloti,
    Pputida, Psyringae, Xcampestris 2
  • Acalcoaceticus, Atumefaciens, Ccrescentus, Mloti,
    Paeruginosa, Psyringae, Xcampestris 2
  • Acalcoaceticus, Atumefaciens, Ccrescentus, Mloti,
    Paeruginosa, Pputida, Xcampestris 2

12
Result Verification
  • 71 total sRNAs related to E. coli already found
    to be annotated in Rfam database were used as
    benchmark
  • Of those
  • 15 found by computational method that were also
    listed in Rfam and not tRNAs
  • 6 not found due to shortcomings of method
  • 29 tRNAs already annotated as gene loci in E.
    coli genome sequence used
  • 10 E. coli plasmid loci not found in full E.
    coli genome sequence used
  • 2 4.5S RNAs already annotated as gene loci in
    E. coli genome sequence used
  • 2 E. coli reverse transcriptase loci not found
    in full E. coli genome sequence used
  • 1 E. coli insertion sequence not found in full
    E. coli genome sequence used
  • 1 E. coli small RNA annotated separately, not
    found in full E. coli genome sequence used
  • 1 Antisense RNA already annotated as gene locus
    in E. coli genome sequence used
  • 1 Cloning vector with E. coli promoter not
    found in full E. coli genome sequence used
  • 1 E. coli transposable element not found in
    full E. coli genome sequence used
  • 1 Reporter vector not found in full E. coli
    genome sequence used
  • 1 E. coli retron not found in full E. coli
    genome sequence used

13
Candidates for ExperimentalVerification of
Findings
  • For the following 2 slides
  • Family designation expressed as Organism name
    locus absolute start location locus absolute
    end location and is synonymous with the first
    (header) entry of that family
  • Entries refer to number of different organism (2
    chromosomes counted separately) sRNA entries in
    the family
  • Length (nt) and score only refer to the header
    entry of the family
  • Scores calculated by QRNA program with log odds
    post for RNA likelihood as opposed to null
    hypothesis

14
Candidates for ExperimentalVerification of
Findings
  • Top 10 highest statistically scoring E. coli sRNA
    loci found by computational method
  • Family designation Ecoli 3941194 3941327
    Length 133 Score 34.114
  • Family designation Ecoli 2744345 2744445
    Length 100 Score 29.631
  • Family designation Ecoli 780875 781068 Length
    193 Score 29.194
  • Family designation Ecoli 2687537 2687689
    Length 152 Score 27.734
  • Family designation Ecoli 2519348 2519548
    Length 200 Score 23.876
  • Family designation Ecoli 4169337 4169400
    Length 63 Score 21.625
  • Family designation Ecoli 4038218 4038281
    Length 63 Score 21.596
  • Family designation Ecoli 2751994 2752022
    Length 28 Score 20.893
  • Family designation Ecoli 3420989
    3421058 Length 69 Score 20.821
  • Family designation Ecoli 3808832
    3808858 Length 26 Score 16.995

15
Candidates for ExperimentalVerification of
Findings
  • Top 10 largest sRNA families found by
    computational method
  • Family designation Styphimurium 3358766
    3358804 Entries 18 Length 38 Score 4.590
  • Family designation Ecarotovora 3161909
    3161946 Entries 15 Length 37 Score 12.604
  • Family designation Ecarotovora 1144121 1144141
    Entries 12 Length 20 Score 5.265
  • Family designation Styphimurium 3342804 3342899
    Entries 12 Length 95 Score 4.328
  • Family designation Ecarotovora 2597534 2597593
    Entries 10 Length 59 Score 3.343
  • Family designation Paeruginosa 2508264 2508282
    Entries 9 Length 18 Score 7.068
  • Family designation Styphimurium 975191
    975219 Entries 8 Length 28 Score 16.296
  • Family designation Styphimuriu 3746886 3746903
    Entries 8 Length 17 Score 1.146
  • Family designation Ecarotovora 3477891 3477922
    Entries 8 Length 31 Score 2.697
  • Family designation Ecarotovora 4490537
    4490683 Entries 7 Length 146 Score 16.753
  • This last entry was used a sample for detailed
    study and is discussed subsequently.

16
Detailed Study of Located Sample sRNA
  • Hit to Alpha_RBS RNA (Rfam RF00140) (115 nt)

Rfam Sequence GUCCUUGAUAUUCUGUUUGAGUAUCCUGAAAACGG
GCUUUUCAAGAUCAGAAUAUCAAAUUAAUUAAAAUAUAGGAGUGCAUAGU
GGCCCGUAUUGCAGGCAUUAACAUUCCUGAU
Organism Location (in genome) Length(nt) Score Ne
ighboring Genes Ecarotovora 4490537-4490683
146 16.753 rpsM - rpmJ Pluminescens
5487752-5487866 114 10.791 rpsM -
secY Ypestis 232330-232476 146 15.757
rpmJ - rpsM Styphimurium 3585744-3585879 135
41.980 rpsM - rpmJ Senterica
4243623-4243770 147 40.046 rpmJ -
rpsM Ecoli 3440108-3440255 147 43.556
rpsM - rpmJ Sflexneri 3426855-3427002 147
41.980 rpsM - rpmJ
17
Detailed Study of Located Sample sRNA
  • Most Likely (Lowest Free Energy) Predicted Fold
    of 80 nt Segment of Sequence
  • Mfold by Zuker et al, 2004 Used

18
Another Approach to Finding sRNAs in E. Coli
Paper Summary
19
Method Used in Paper to Find Putative sRNAs
  • A database of all E. coli intergenic DNA
    sequences was created based on gene annotations
    in early release of the EcoGene database, and
    used as input to profile search program
    (pftools2.2, Swiss Bioinformatics Institute) set
    to find sigma-70 promoter
  • Terminator motif was searched for in database
    using following search criteria (1) An 11-nt
    A-rich region (2) variable-length hairpin (3)
    variable-length spacer (4) 5-nt T-rich region
    nearest the hairpin and (5) 7-nt distal extra
    T-rich region
  • Predicted promoter and terminator pairs were
    combined to generate putative sRNAs if (1) pair
    was on same strand and (2) pair was greater than
    45 but less than 350 nt apart
  • To verify, open reading frames and possible
    ribosome binding sites were searched for
    downstream of each promoter

20
Synopsis of Method Used in Paper
  • Using the E. Coli MG1655 genome, DNA regions that
    contained a sigma-70 promoter within a short
    distance of a rho-independent terminator were
    searched for
  • 227 putative sRNAs between 80 and 400 nt in
    length were predicted in E. coli by paper, 32 of
    which were already known to be sRNAs
  • Transcripts of some of the candidate loci were
    verified using Northern hybridization
  • Approach may possibly be used in annotating sRNA
    loci in other bacterial genomes

21
Verification of Paper Results with Results Using
Our Method
  • Along with other results, the paper gives a
    detailed listing of the 277 sRNAs predicted,
    including the designation, strand orientation
    (forward or reverse), left and right boundaries
    (nt from genome start position), and length (nt)
    of each sRNA
  • Left and right boundary positions in genome given
    by paper were compared with left and right
    boundary positions of putative sRNAs found by our
    method
  • If an sRNA candidate from the paper was within
    100 nt of any sRNA predicted by our method, that
    sRNA was scored as found

22
Results of Verification
  • 227 candidate sRNAs were predicted in E. coli by
    the paper
  • Among them, 150 (66.1 ) were localized by our
    method, according to previously utilized criteria
  • The test was re-run with a 50 nt threshold,
    yielding 140 hits (61.7 ), a 10 nt threshold,
    yielding 128 hits (56.4 ), and a 1000 nt
    threshold, yielding 187 hits (82.4 )

23
Preliminary Procedure for Extracting Small
Peptides
Extract Intergenic Regions From 30 Sequenced
Genomes
Observe Results and Refine Extraction Method
Blast Resulting Family Entries Against SwissProt
Database
Perform All Vs. All Nucleotide- Nucleotide BLAST
Extend Regions Within 25 nt Of Other Reions
Causing Them To Include Each Other
Merge sRNA Regions Which Align or Exactly
Overlap Into Families
Run QRNA, Extract Alignments Scoring as Coding
vs. sRNA and Null Hypothesis Regions
Select Significant Alignments, Concatenate and
Format into QRNA Program Input
Score Regions Based on Quality of Fit Inside a
Nearby Open Reading Frame
24
Preliminary Results of Small Peptide Search
  • Tblastx Alignment

Query 133 LPPNAGTYVPACWPSPALPYRQIPPEYPDSNP
38 Subject 1373 LPPLXTSXXPPPPPPPSXPLXSLPPSXPPSLP
1278
  • Query Sequence Information

Organism Location (in genome) Length(nt) E-Value
Erwinia Carotovora 843815-843948 133 0.69 Ali
gns To gbAAF36091.1 flagelliform silk protein
Nephila madagascariensis Sequence aattccgtcgcat
gttctctggtgagtacgacagcgcggattgctatctggatattcaggcgg
gatctggcggtacggaagcgcaggactgggccagcatgctggtacgtatg
tacctgcgttgggcggaagc
25
Preliminary Results of Small Peptide Search
  • Tblastx Alignment

Query 62 PRATAPHPDPVRPAPETAPTP 124 Subject
90 PPAPAPRPPPVAPAPRPLPPP 28
  • Query Sequence Information

Organism Location (in genome)
Length(nt) E-Value Pseudomonas
syringae 6171796-6172006 210 0.23
Aligns To embCAD88221.2 C. elegans GRL-25
protein (corresponding sequence ZK
643.8) Sequence tgagttccggcagctcgtcatccagcttctgac
gcaaccgcccggtcagaaacgcaaagccctcgagcaaccgctccacatcc
ggatcccgtccggcctgccccagaaacggcgccaacgccggactacgctc
ggcgaagcgacgaccaagctggcgcagtgcagtgagttcgctctggtagt
aatggttaaaggacacgggttacctgc
26
Conclusions
  • Possible sRNAs are found from 2039 of the
    intergenic regions in each organism
  • Among them, 31 of the sRNAs satisfy the
    log-odds score threshold of 5.0 or higher
  • 137 families are conserved in equal to or more
    than 5 organisms
  • Being well conserved, sRNAs may be responsible
    for fundamental functions of living organisms

27
Future Direction
  • Search for sRNAs will be expanded to a larger
    quantity of more diverse genomes
  • Secondary structure prediction will be later
    employed in greater detail to verify well
    conserved sRNA regions among multiple
    evolutionarily distant organisms
  • Experimental verification of the findings of this
    particular study under way (particularly for
    Shewanella oneidensis)
  • Comparative genomics will be used to discover the
    function associated with each sRNA and possibly
    lead to learning its part in pathway
Write a Comment
User Comments (0)
About PowerShow.com