Title: Diversity of UTR Regions
1Diversity of UTR Regions
Alternate Transcript Diversity
2Part I EST-based analysis of polyadenylation
and UTRs
3The PolyA Site (PAS)
PAS
stop
UTR
AAAA
3 exon
PolyA signal
17nt
AATAAA
AAAAAAAAA
T
4Alternative PolyA Sites
From Edwalds Gilbert et al. , NAR, 25, 2547, 1997
5Alternative PAS Post-transcriptional
(de)regulation
Possible regulatory element (stability,
translation, transport)
Coding sequence
3' UTR
AUUAAA
AUUAAA
AUUAAA
AUUAAA
AUUAAA
- Use of abnormal polyA site is associated to
various diseases - A/B Thalassemia (globin)
- Mantle cell lymphoma (Cyclin CCND1)
- Teratocarcinoma (PDGF)
- Hypertension (Ca2 ATPase)
6PAS Discovery through EST/mRNA Alignment
mRNA or EST-contig
5ESTs
ESTs
3ESTs
First observation in 1998 189 cases of
alternative polyadenylation
Gautheret et al. (1998) Genome Res. 8, 524
72000 1000 Genes Observed w/ Alt PAS
(estimation at least 22 of genes)
Beaudoing et al. (2000) Genome Res. 10, 1001
8EST Counts as a Measure of Signal Efficiency
1 site
6
4
EST/ Site
2
0
2 sites
AAUAAA
6
AUUAAA
4
other 1-base variants
EST/ Site
2
others
0
3 sites
6
- Distal site more efficient than proximal
- AAUAAA signal more efficient than variant signal
4
EST/ Site
2
0
4 sites
6
4
EST/ Site
2
0
1
9Tissue-specificsites
450
Site
2700
450
2700
Bone
2
10
Others
49
11
Fishers Exact P0.00003
1942 biases in 951 different human 3UTR
Bone
Beaudoing Gautheret (2001) Genome Res. 9, 1520
10Improved Definition of PAS signals
- How does the polyA machinery tells a true
cleavage site from a random AATAAA? - What other signals help dictate use of specific
sites in certain conditions?
Upstream Seq. Elemt. enhancer element Mostly
found in viral Sequences
Downstream Seq. Elemt constitutive Poorly
defined Mutations tolerated
11Analysis of PAS-flanking regions
Genomic
12Most Frequent Hexamers
13The DSE is a U-rich Region
Nucleotide frequencies
Legendre Gautheret, BMC Genomics (2003)
Position (0polyA signal)
14USE, DSE and Polyadenylation Efficiency
U
Strong sites
Weak sites
USE Paired t-test data USE weak and USE strong
t -0.1826, df 35, p-value 0.8562
DSE Paired t-test data DSE weak and DSE strong
t -3.5876, df 35, p-value 0.001010
Position (0polyA signal)
15EST-based PAS Map
2005
- 3 ESTs mapped directly to Genome
- 67,000 polyA sites identified
- 28000 sites within 2K of an Ensembl gene
- 29000 sites not within 10K of an Ensembl gene
- 57 of genes have 2 or more polyA sites
- (was 22 in 2000!)
16The ATD Project
- Integrate SplicePolyAInit variants
- Quality control
- Tissue-specific Isoforms
- Regulatory motifs
- Isoform specific oligos
- RT-PCR validation of selected isoforms
The ATD project is funded by the European
Commission within its FP6 Programme, under the
thematic area "Life sciences, genomics and
biotechnology for health", contract number
LHSG-CT-2003-503329
17Revisiting UTR Length
18What is the actual reach of 3UTR?
- Textbook Human Molecular Genetics 2 (1999)
- 3' UTR Average of about 0.6 kb (see Zhang, 1998)
but this is likely to be an underestimate because
of underreporting of genes with long 3' UTRs - Untranslated Regions of mRNA (Mignone et al.
2003)
!
19Many recent papers mentionning distal PAS
- All rely on EST sampling, but
- Require alignment on refseq gene/fl cdDNA or
overlapping ESTs - Cannot assess all long range PAS
20How can you make sure a PAS pertains to the
nearest 5 gene ?
- In the absence of overlapping ESTs danger!
- There could be another short gene in the interval
- PAS could be just noise (remember 29000 PAS are
gt10kb from any Ensembl gene) - gt We need a gauge to evaluate PAS reality
21Gauge signal usage
Mouse Human
15 kb
AAUAAA all 11 signals
Ratio
Distance from STOP
Background is not only FP!
- Noisy PAS are expected to use random polyA
signals - Not dependent on EST coverage
- True PAS appear dominant up to 15kb!
22Direct UTR counts
15kb
sites
Distance from STOP
23Integrate
- Twice the length of 3 UTR in Ensembl, Refseq,
full length cDNAs - At least 4000 human genes have, in their longest
form, a 3 UTR larger than 3kb in length.
24Intergenic polyA
- About 50 of predicted polyA sites fall in
intergenic regions (gt15kb from Stop) - Consistent with recent tiling microarray data
(Rosetta etc.) - We estimate that 75 of our intergenic polyA
sites are true
25UTR length
- Two independent measures converge towards
significant numbers of UTRs at least up to 15kb - Ensembl/Refseq average 3UTR (longest non-zero
UTR) 1kb - Actual value 2.4 kb. Then each Ensembl/refseq
gene lacks in average1.5kb in its UTR! - Chicken is shorter but poor sampling (not shown)
- Mostly due to alternative polyadenylation
- Search space for regulatory motifs, miRNA targets
etc. is doubled (additional 22 Mb)
26Selected PAS or transcriptional leakage?
- 3 UTR sizes from orthologues (Ensembl).
- Chi2 Probability 0!
- Long UTR in human gt long UTR in mouse
27Conservation of multiple polyadenylation
- Number of PAS in orthologous genes
- Chi2 P-value lt 10-30
mouse
human
28Conservation function
- Alternative polyadenylation appears to be
regulated - Increased importance of UTR extension as target
for postranscriptional regulation
29Alternative PAS Conservation across Species
Identifying Regulated Alternative PAS
(ongoing work)
30What is a Conserved PAS?
PAS site
human
Specific
Orthologs
mouse
Partially Conserved
rat
Conserved
Detect and Classify
31Topology Alone is Ambiguous
human
?
?
mouse
Use sequence conservation
32Best species for studying conserved alt PAS
(another reason to like chicken)
- 5 species used
- human
- chimpanzee
- mouse
- rat
- chicken
From Margulies et al. Genome Res. 2003
33Criteria for Conserved PAS Detection
Scan (window6bp, shift1bp)
conserved block
PolyA signal
Flanking region (at least one)
25bp
25bp
PolyA signal should be conserved in N species and
flanking region has gt65 identity over N species.
N2,3,4,5
More stringent than usual criteria for
identifying selective pressure
34Conserved PAS candidates Supported by EST Mapping
40
30
Conserved PAS covered
30 of conserved PAS are supported by ESTs
20
10
0
No cons.
N2
N3
N4
N5
N-species conservation
- We should require at least 3-species conservation
to consider a PAS as conserved
35A Significant Fraction of Genes has Conserved PAS
- Over 22000 annotated human genes
- gt20 have a putative CONPAS (at least 3-species)
- 10 have multiple putative CONPAS
- 7 have a putative CONPAS supported by ESTs
- Compares to 10-15 conserved alt-splice variants
- Suggests selective pressure for many polyA site
sequences in animal genomes
36Why should PAS be embedded in conserved sequences?
- Regulatory protein binding site?
- Regulatory RNA structure?
- Antisense or miRNA targets?
Probably not
In some cases
Probably most cases
37Part IIErpin News
38ERPIN Profile-based RNA Motif Search
Training set
AA
GA
CA
Helix profile (16xN)
Single-strand profile (5xN)
UA
AG
GG
CG
...
UG
Sb1,b2 log(Fb1b2 /Fb1xFb2)
UU
Search algorithm combines dynamic programming for
single strands and profile search for helices
Gautheret Lambert, JMB, 2001, 313, p. 1005.
39Recent development pseudocounts
A Mir-133 training-set
(( - ((((((( ------ ((( - (((( -------------------
--- )))) ))) ------ ))))))) - )) TC t GGCTGGT
caaac- GGA a CCAA gtccgtcttcctgagaggt--- TTGG TCC
CCTTCA ACCAGCT a CA TG t GGCTGGT caaac- GGA a
CCAA gtcaggtgtttctgtgaggt-- TTGG TCC CCTTCA
ACCAGAC t AT TG t GGCTGGT aaaac- GGA a CCAA
gtcaggtgtttttgtgaggt-- TTGG TCC CCTTCA ACCAGCT a
TG TG c GGCTGGT gaaaa- GGA a CCAC
atcaacccagaaaaaggat--- TTGG TCC CCTTCA ACCAGCC g
CA TA t GGCTGGT caaac- GGA a CCAA
gtccgtcttccttagaggt--- TTGG TCC CCTTCA ACCAGCT a
TT AG t TGCTGGT aaaac- GGA a CCAA
gtcgggtgtttgcgagaggt-- TTGG TCC CTTTCA ACCAGCT a
CT TG t GGCTGGT caaat- GGA a CCAA
gtcaggtgtttctgcgaggt-- TTGG TCC CCTTCA ACCAGCT a
CT
100 CG Other scores log (obs/expected)
abritrary low value! What about GC or AU in
this column? Is it as bad as CC or AG?
Answer fill columns with expected counts, based
on a reasonable model Pseudocounts. Require RNA
bp and ss substitution matrices
40RNA substitution matrices
Obtained from eukarchaebac 16S/18S rRNA
alignement
AA AT AG AC TA TT TG TC GA GT
GG GC CA CT CG CC 6.54e-04 5.20e-06
3.88e-05 4.22e-05 2.13e-05 5.51e-06 1.21e-05
3.84e-05 8.52e-05 1.28e-05 1.76e-04 2.89e-06
1.47e-05 6.47e-06 3.19e-06 4.69e-06 7.96e-05
9.00e-04 5.19e-05 1.78e-04 1.69e-04 1.43e-04
8.85e-05 1.86e-04 4.15e-05 1.69e-04 1.22e-04
1.99e-04 8.73e-05 2.44e-04 1.25e-04 3.30e-04
1.00e-04 8.72e-06 1.35e-03 1.27e-04 1.72e-05
5.09e-06 3.10e-05 1.38e-04 5.74e-05 1.59e-05
1.01e-04 8.22e-06 9.99e-06 1.62e-05 1.33e-05
2.56e-05 4.11e-05 1.13e-05 4.81e-05 9.79e-04
2.79e-06 7.02e-06 2.79e-06 4.47e-05 4.93e-06
1.97e-05 3.05e-05 8.06e-06 5.40e-06 1.55e-05
2.47e-06 7.30e-05 4.23e-04 2.19e-04 1.33e-04
5.69e-05 1.16e-03 2.21e-04 2.35e-04 2.78e-04
9.59e-05 1.18e-04 1.79e-04 1.08e-04 3.54e-04
2.04e-04 2.24e-04 9.28e-05 1.05e-05 1.80e-05
3.80e-06 1.38e-05 2.14e-05 9.30e-04 2.57e-05
7.75e-05 5.79e-06 2.33e-05 4.87e-05 1.18e-05
1.57e-05 8.72e-05 1.83e-05 5.25e-04 1.05e-04
5.03e-05 1.04e-04 2.49e-05 1.03e-04 1.16e-04
1.14e-03 1.80e-04 4.69e-05 4.56e-05 1.25e-04
4.26e-05 1.70e-04 2.15e-04 7.52e-05 3.23e-05
1.45e-05 4.59e-06 2.03e-05 1.73e-05 5.30e-06
1.52e-05 7.82e-06 1.60e-04 4.55e-06 8.99e-06
4.77e-06 3.66e-06 9.00e-06 6.17e-05 2.95e-06
1.61e-05 2.57e-04 8.19e-06 6.74e-05 1.53e-05
1.46e-05 9.11e-06 1.63e-05 3.64e-05 1.47e-03
2.50e-05 8.70e-05 2.12e-05 3.02e-05 2.83e-05
4.40e-06 8.02e-06 1.24e-04 1.07e-04 6.02e-05
1.96e-04 5.81e-05 1.18e-04 5.10e-05 2.31e-04
8.04e-05 1.28e-03 9.39e-05 8.77e-05 2.53e-05
9.12e-05 3.55e-05 4.58e-05 1.82e-04 8.24e-06
4.08e-05 3.24e-05 9.35e-06 2.61e-05 1.49e-05
1.30e-05 2.97e-05 9.98e-06 5.62e-04 6.96e-06
6.83e-06 8.80e-06 1.32e-05 1.06e-05 1.14e-04
5.14e-04 1.26e-04 3.27e-04 2.16e-04 2.44e-04
1.94e-04 3.84e-04 2.78e-04 3.57e-04 2.67e-04
1.49e-03 1.07e-04 5.26e-04 2.57e-04 2.87e-04
1.30e-05 5.04e-06 3.43e-06 4.90e-06 1.58e-05
7.22e-06 1.73e-05 2.10e-05 8.85e-06 2.30e-06
5.85e-06 2.40e-06 5.30e-04 5.30e-05 1.58e-05
7.68e-06 3.86e-06 9.54e-06 3.78e-06 9.51e-06
6.16e-06 2.71e-05 1.48e-05 9.78e-05 5.60e-06
5.61e-06 5.10e-06 7.94e-06 3.58e-05 2.95e-04
5.22e-06 3.52e-05 1.04e-04 2.68e-04 1.70e-04
8.32e-05 3.71e-04 3.12e-04 2.83e-04 2.56e-04
4.77e-05 1.19e-04 4.21e-04 2.12e-04 5.86e-04
2.86e-04 1.35e-03 2.50e-04 2.13e-06 9.81e-06
4.54e-06 3.41e-05 2.12e-06 1.24e-04 1.69e-06
1.94e-05 1.21e-06 2.14e-06 4.70e-06 3.31e-06
3.95e-06 2.68e-05 3.48e-06 5.45e-04 A T
G C 9.13e-04 8.22e-05 1.05e-04 9.35e-05
5.57e-05 6.70e-04 7.98e-05 1.41e-04 6.94e-05
7.78e-05 7.32e-04 5.03e-05 4.09e-05 9.15e-05
3.33e-05 6.03e-04
AA AT AG AC TA TT TG TC GA GT GG GC CA CT CG CC
A T G C
41Recent developmentE-values for RNA Motifs
Based on discrete convolution analysis of profiles
simulated
computed
42The ERPIN Server
http//tagc.univ-mrs.fr/erpin/
All searches parameterized to scan a bacterial
genome in less than 5 minutes
43Micro-RNA Search
- 18 training sets build for 18 miRNA families
- Using CLUSTALW Alifold
- 10 sequences in average
Legendre, Lambert, Gautheret, Bioinformatics 2004
44ERPIN vs WU-BLAST
- 20 animal genomes scanned
- Sensitive WU-BLAST parameters (W7)
- E-value ? 0.01
WU-BLAST
ERPIN
43 (0)
45Analysis of a miR Cluster
miR17 cluster
ciona
ciona
Grey initial training set E indicates hits
identified by ERPIN only, EB indicates hits
identified by both ERPIN and BLAST.
- Important homologues missed by WU-BLAST
- Profile search a must in miRNA detection
46TAGC lab RNA Bioinformatics
- Takeshi Ara
- Fabrice Lopez
- Matthieu Legendre
- William Ritchie
- Daniel Gautheret