Title: Identification of noncoding RNA with the ERPIN program
1Identification of non-coding RNAwith the ERPIN
program
Daniel Gautheret INSERM ERM 206 Techniques
avancées en Génomique et Clinique Marseille André
Lambert CNRS Centre de Physique
Théorique Marseille
2RNomics
- ncRNA genes
- tRNA
- 5S16S30S rRNA
- snoRNA (C/D H/ACA box)
- snRNA/uRNA
- tmRNA
- miRNA
- SRP RNA
- RNAse P
- RNASE MRP
- rprA,csrB,oxyS
- Short Motifs
- E-loop
- GNRA loop
- UNCG loop
- AA platform
- Functional Motifs
- IRE
- IRES
- SECIS
- TAR
- Hammerhead
- Hairpin ribozyme
- Rho dept. transcription terminator
- T-box antiterminator
3The specificity of RNA search
s
BLAST
HMM
s
s
- ncRNA is defined both by primary and secondary
structure - Substitution matrices for nucleic acids are
terrible compared to PAM etc.
Current ncRNA annotation is terrible!
4The trouble with existing programs
- Specific Programs (tRNA-scan)
- Descriptor-based programs
- Subjective and painful descriptor generation
- Subtle constraints easily overlooked (e.g.
certain pairings forbidden) - Yes/no answer no scoring
- Stochastic Context Free Grammars
- Not practical for large alignments or
genome-wide searches (Eddy, 1999) - Time cost O(N4) for sequence of length N
- Pseudoknots not allowed
h1 s1 h1 s2 h2 s3 h2 h1 55 1 h2 55
NNNNRYNNNN s1 77 NUNNNNN s2 440 s3 77
UUCNNNN
RnaMot descriptor for anticodonTYC domain of tRNA
5RNA Structure Requires Probabilistic Descriptors
- Most nts are biased
- Many biases escape human inspection
- Probabilistic descriptors are needed to
incorporate this
Can we use this information practically?
R. Gutells consensus bacterial 16S
6ERPIN Profile-based RNA Search
Lambert Gautheret, JMB, 313, 1003-1011 (2001)
l10
l14
h3
h5
Target sequence
Helix score for h5-h3 computed from helix profile
Sequence (14 nt)
best score for l10 (4 gaps)
GTTCTTGCATGTTTGACGGAAC GTTCTTGCATGATTGACGGAAC GTTC
TTGCATGTTTGACGGAAC TTTCCTGCATGCTTGACGGAAC TTTAT--C
AAGTTCAT-ATAAA ATTAT--CGTGCCTTC-ATAAT ATTAT--CGTGT
CTTC-ATAAT ATTAT--CATGTTTC--ATAAT
best score for for l14 (0 gaps)
Single-strand profile (14 positions)
Training set
7Validation
- Original paper validation on known RNA genes and
motifs - tRNA
- IRE (Iron Response Element)
- SECIS (Selenocysteine Insertion Sequence)
8Handling of complex structures, search strategies
(Erpin 2 3)
tRNA Training set
- h1111111aa2222bbbbbbbbbbb2222c33333ddddddd33333eee
eeeeeeeeeeeeeeeeeeeee44444fffffff444441111111gggg - gtDA0260 TGC PHAGE T5
- -GGGCGAATAGTGTCAGC-GGG--AGCACACCAGACTTGCAATCTGGTA-
------------------G-GGAGGGTTCGAGTCCCTCTTTGTCCACCA - gtDA0340 TGC ARCHAEGLOBUS FULG. ARCHAE
- -GGGCTCGTAGCTCAGC--GGG--AGAGCGCCGCCTTTGCGAGGCGGAG-
------------------GCCGCGGGTTCAAATCCCGCCGAGTCCA--- - gtDA0380 TGC HALOBACTERIUM CUT. ARCHAE
- -GGGCCCATAGCTCAGT--GGT--AGAGTGCCTCCTTTGCAAGGAGGAT-
------------------GCCCTGGGTTCGAATCCCAGTGGGTCCA---
- First search for TYC loop
- If found, look for all other elements
- -gt Faster search with little sensitivity loss
9SECIS search
- Selenocysteine Insertion Sequence
- required for readthrough of UGA codon
- signature for selenoprotein genes
- Previous searches
- RNAMOT 30 hits / 100mb.
- PATSCAN energy 150 hits / 100mb
- Additional filters necessary
- Analysis of reading frame
- Experiments
- Can we do better?
SECIS
stop
UTR
CDS
10Erpin score and false positive rates
- Y 6.6 105 e-0.3 S
- E Kmn e-? S
- (extreme value distrib.)
- Score 30 (min)
- 3.8 hits/100mb
- Score 40 (ave)
- 0.13 hits/100mb
SECIS element hits in 700 mb randomized sequences
11 SECIS Iterative Search
- Initial training set 43 SECIS sequences
- Total scanned 4 Gb
- After 5 rounds 120 sure SECIS 200 novel
12The Erpin Download Site
13The ncRNA annotation Project
- Create training sets for all major classes of
ncRNA - Collaborative projects
- Establish global ncRNA annotation resource
14All searches parametered to scan a bacterial
genome in less than 1 minute
http//tagc.univ-mrs.fr/pub/erpin/
15(No Transcript)
16How is a protein-coding gene detected?
- ORF
- Pol II Promoters and CpG islands
- Codon/hexanucleotide statistics (species-specific
biases) -gt HMM - Availability of
- ESTs
- Large protein gene sequence databases