Identification of noncoding RNA with the ERPIN program - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Identification of noncoding RNA with the ERPIN program

Description:

Subjective and painful descriptor generation ... SECIS (Selenocysteine Insertion Sequence) ncRNA detection ... Selenocysteine Insertion Sequence. required for ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 17
Provided by: ep72
Category:

less

Transcript and Presenter's Notes

Title: Identification of noncoding RNA with the ERPIN program


1
Identification of non-coding RNAwith the ERPIN
program
Daniel Gautheret INSERM ERM 206 Techniques
avancées en Génomique et Clinique Marseille André
Lambert CNRS Centre de Physique
Théorique Marseille
2
RNomics
  • ncRNA genes
  • tRNA
  • 5S16S30S rRNA
  • snoRNA (C/D H/ACA box)
  • snRNA/uRNA
  • tmRNA
  • miRNA
  • SRP RNA
  • RNAse P
  • RNASE MRP
  • rprA,csrB,oxyS
  • Introns
  • Group I
  • Group II
  • Short Motifs
  • E-loop
  • GNRA loop
  • UNCG loop
  • AA platform
  • Functional Motifs
  • IRE
  • IRES
  • SECIS
  • TAR
  • Hammerhead
  • Hairpin ribozyme
  • Rho dept. transcription terminator
  • T-box antiterminator

3
The specificity of RNA search
s
BLAST
HMM
s
s
  • ncRNA is defined both by primary and secondary
    structure
  •  Substitution matrices  for nucleic acids are
    terrible compared to PAM etc.

Current ncRNA annotation is terrible!
4
The trouble with existing programs
  • Specific Programs (tRNA-scan)
  • Descriptor-based programs
  • Subjective and painful descriptor generation
  • Subtle constraints easily overlooked (e.g.
    certain pairings forbidden)
  • Yes/no answer no scoring
  • Stochastic Context Free Grammars
  • Not  practical  for large alignments or
    genome-wide searches (Eddy, 1999)
  • Time cost O(N4) for sequence of length N
  • Pseudoknots not allowed

h1 s1 h1 s2 h2 s3 h2 h1 55 1 h2 55
NNNNRYNNNN s1 77 NUNNNNN s2 440 s3 77
UUCNNNN
RnaMot descriptor for anticodonTYC domain of tRNA
5
RNA Structure Requires Probabilistic Descriptors
  • Most nts are biased
  • Many biases escape human inspection
  • Probabilistic descriptors are needed to
    incorporate this

Can we use this information practically?
R. Gutells consensus bacterial 16S
6
ERPIN Profile-based RNA Search
Lambert Gautheret, JMB, 313, 1003-1011 (2001)
l10
l14
h3
h5
Target sequence
Helix score for h5-h3 computed from helix profile
Sequence (14 nt)
best score for l10 (4 gaps)
GTTCTTGCATGTTTGACGGAAC GTTCTTGCATGATTGACGGAAC GTTC
TTGCATGTTTGACGGAAC TTTCCTGCATGCTTGACGGAAC TTTAT--C
AAGTTCAT-ATAAA ATTAT--CGTGCCTTC-ATAAT ATTAT--CGTGT
CTTC-ATAAT ATTAT--CATGTTTC--ATAAT
best score for for l14 (0 gaps)
Single-strand profile (14 positions)
Training set
7
Validation
  • Original paper validation on known RNA genes and
    motifs
  • tRNA
  • IRE (Iron Response Element)
  • SECIS (Selenocysteine Insertion Sequence)

8
Handling of complex structures, search strategies
(Erpin 2 3)
tRNA Training set
  • h1111111aa2222bbbbbbbbbbb2222c33333ddddddd33333eee
    eeeeeeeeeeeeeeeeeeeee44444fffffff444441111111gggg
  • gtDA0260 TGC PHAGE T5
  • -GGGCGAATAGTGTCAGC-GGG--AGCACACCAGACTTGCAATCTGGTA-
    ------------------G-GGAGGGTTCGAGTCCCTCTTTGTCCACCA
  • gtDA0340 TGC ARCHAEGLOBUS FULG. ARCHAE
  • -GGGCTCGTAGCTCAGC--GGG--AGAGCGCCGCCTTTGCGAGGCGGAG-
    ------------------GCCGCGGGTTCAAATCCCGCCGAGTCCA---
  • gtDA0380 TGC HALOBACTERIUM CUT. ARCHAE
  • -GGGCCCATAGCTCAGT--GGT--AGAGTGCCTCCTTTGCAAGGAGGAT-
    ------------------GCCCTGGGTTCGAATCCCAGTGGGTCCA---
  • First search for TYC loop
  • If found, look for all other elements
  • -gt Faster search with little sensitivity loss

9
SECIS search
  • Selenocysteine Insertion Sequence
  • required for readthrough of UGA codon
  • signature for selenoprotein genes
  • Previous searches
  • RNAMOT 30 hits / 100mb.
  • PATSCAN energy 150 hits / 100mb
  • Additional filters necessary
  • Analysis of reading frame
  • Experiments
  • Can we do better?

SECIS
stop
UTR
CDS
10
Erpin score and false positive rates
  • Y 6.6 105 e-0.3 S
  • E Kmn e-? S
  • (extreme value distrib.)
  • Score 30 (min)
  • 3.8 hits/100mb
  • Score 40 (ave)
  • 0.13 hits/100mb

SECIS element hits in 700 mb randomized sequences
11
SECIS Iterative Search
  • Initial training set 43 SECIS sequences
  • Total scanned 4 Gb
  • After 5 rounds 120 sure SECIS 200 novel

12
The Erpin Download Site
13
The ncRNA annotation Project
  • Create training sets for all major classes of
    ncRNA
  • Collaborative projects
  • Establish global ncRNA annotation resource

14
All searches parametered to scan a bacterial
genome in less than 1 minute
http//tagc.univ-mrs.fr/pub/erpin/
15
(No Transcript)
16
How is a protein-coding gene detected?
  • ORF
  • Pol II Promoters and CpG islands
  • Codon/hexanucleotide statistics (species-specific
    biases) -gt HMM
  • Availability of
  • ESTs
  • Large protein gene sequence databases
Write a Comment
User Comments (0)
About PowerShow.com