Diversity of UTR Regions - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Diversity of UTR Regions

Description:

Use of abnormal polyA site is associated to various diseases: A/B Thalassemia (globin) ... Consistent with recent tiling microarray data (Rosetta etc. ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 47
Provided by: www293
Category:

less

Transcript and Presenter's Notes

Title: Diversity of UTR Regions


1
Diversity of UTR Regions
  • INSERM TAGC, Marseille

Alternate Transcript Diversity
2
Part I EST-based analysis of polyadenylation
and UTRs
3
The PolyA Site (PAS)
PAS
stop
UTR
AAAA
3 exon
PolyA signal
17nt
AATAAA
AAAAAAAAA
T
4
Alternative PolyA Sites
From Edwalds Gilbert et al. , NAR, 25, 2547, 1997
5
Alternative PAS Post-transcriptional
(de)regulation
Possible regulatory element (stability,
translation, transport)
Coding sequence
3' UTR
AUUAAA
AUUAAA
AUUAAA
AUUAAA
AUUAAA
  • Use of abnormal polyA site is associated to
    various diseases
  • A/B Thalassemia (globin)
  • Mantle cell lymphoma (Cyclin CCND1)
  • Teratocarcinoma (PDGF)
  • Hypertension (Ca2 ATPase)

6
PAS Discovery through EST/mRNA Alignment
mRNA or EST-contig
5ESTs
ESTs
3ESTs
First observation in 1998 189 cases of
alternative polyadenylation
Gautheret et al. (1998) Genome Res. 8, 524
7
2000 1000 Genes Observed w/ Alt PAS
(estimation at least 22 of genes)
Beaudoing et al. (2000) Genome Res. 10, 1001
8
EST Counts as a Measure of Signal Efficiency
1 site
6
4
EST/ Site
2
0
2 sites
AAUAAA
6
AUUAAA
4
other 1-base variants
EST/ Site
2
others
0
3 sites
6
  • Distal site more efficient than proximal
  • AAUAAA signal more efficient than variant signal

4
EST/ Site
2
0
4 sites
6
4
EST/ Site
2
0
1
9
Tissue-specificsites
450
Site
2700
450
2700
Bone
2
10
Others
49
11
Fishers Exact P0.00003
1942 biases in 951 different human 3UTR
Bone
Beaudoing Gautheret (2001) Genome Res. 9, 1520
10
Improved Definition of PAS signals
  • How does the polyA machinery tells a true
    cleavage site from a random AATAAA?
  • What other signals help dictate use of specific
    sites in certain conditions?

Upstream Seq. Elemt.  enhancer element  Mostly
found in viral Sequences
Downstream Seq. Elemt  constitutive  Poorly
defined Mutations tolerated
11
Analysis of PAS-flanking regions
Genomic
12
Most Frequent Hexamers
13
The DSE is a U-rich Region
Nucleotide frequencies
Legendre Gautheret, BMC Genomics (2003)
Position (0polyA signal)
14
USE, DSE and Polyadenylation Efficiency
U
Strong sites
Weak sites
USE Paired t-test data  USE weak and USE strong
t -0.1826, df 35, p-value 0.8562
DSE Paired t-test data  DSE weak and DSE strong
t -3.5876, df 35, p-value 0.001010
Position (0polyA signal)
15
EST-based PAS Map
2005
  • 3 ESTs mapped directly to Genome
  • 67,000 polyA sites identified
  • 28000 sites within 2K of an Ensembl gene
  • 29000 sites not within 10K of an Ensembl gene
  • 57 of genes have 2 or more polyA sites
  • (was 22 in 2000!)

16
The ATD Project
  • Integrate SplicePolyAInit variants
  • Quality control
  • Tissue-specific Isoforms
  • Regulatory motifs
  • Isoform specific oligos
  • RT-PCR validation of selected isoforms

The ATD project is funded by the European
Commission within its FP6 Programme, under the
thematic area "Life sciences, genomics and
biotechnology for health", contract number
LHSG-CT-2003-503329
17
Revisiting UTR Length
18
What is the actual reach of 3UTR?
  • Textbook  Human Molecular Genetics 2  (1999)
  • 3' UTR Average of about 0.6 kb (see Zhang, 1998)
    but this is likely to be an underestimate because
    of underreporting of genes with long 3' UTRs
  • Untranslated Regions of mRNA (Mignone et al.
    2003)

!
19
Many recent papers mentionning distal PAS
  • All rely on EST sampling, but
  • Require alignment on refseq gene/fl cdDNA or
    overlapping ESTs
  • Cannot assess all long range PAS

20
How can you make sure a PAS pertains to the
nearest 5 gene ?
  • In the absence of overlapping ESTs danger!
  • There could be another short gene in the interval
  • PAS could be just noise (remember 29000 PAS are
    gt10kb from any Ensembl gene)
  • gt We need a gauge to evaluate PAS reality

21
Gauge signal usage
Mouse Human
15 kb
AAUAAA all 11 signals
Ratio
Distance from STOP
Background is not only FP!
  • Noisy PAS are expected to use random polyA
    signals
  • Not dependent on EST coverage
  • True PAS appear dominant up to 15kb!

22
Direct UTR counts
15kb
sites
Distance from STOP
23
Integrate
  • Twice the length of 3 UTR in Ensembl, Refseq,
    full length cDNAs
  • At least 4000 human genes have, in their longest
    form, a 3 UTR larger than 3kb in length.

24
Intergenic polyA
  • About 50 of predicted polyA sites fall in
    intergenic regions (gt15kb from Stop)
  • Consistent with recent tiling microarray data
    (Rosetta etc.)
  • We estimate that 75 of our intergenic polyA
    sites are true

25
UTR length
  • Two independent measures converge towards
    significant numbers of UTRs at least up to 15kb
  • Ensembl/Refseq average 3UTR (longest non-zero
    UTR) 1kb
  • Actual value 2.4 kb. Then each Ensembl/refseq
    gene lacks in average1.5kb in its UTR!
  • Chicken is shorter but poor sampling (not shown)
  • Mostly due to alternative polyadenylation
  • Search space for regulatory motifs, miRNA targets
    etc. is doubled (additional 22 Mb)

26
Selected PAS or transcriptional leakage?
  • 3 UTR sizes from orthologues (Ensembl).
  • Chi2 Probability 0!
  • Long UTR in human gt long UTR in mouse

27
Conservation of multiple polyadenylation
  • Number of PAS in orthologous genes
  • Chi2 P-value lt 10-30

mouse
human
28
Conservation function
  • Alternative polyadenylation appears to be
    regulated
  • Increased importance of UTR extension as target
    for postranscriptional regulation

29
Alternative PAS Conservation across Species
Identifying Regulated Alternative PAS
(ongoing work)
30
What is a Conserved PAS?
PAS site
human
Specific
Orthologs
mouse
Partially Conserved
rat
Conserved
Detect and Classify
31
Topology Alone is Ambiguous
human
?
?
mouse
Use sequence conservation
32
Best species for studying conserved alt PAS
(another reason to like chicken)
  • 5 species used
  • human
  • chimpanzee
  • mouse
  • rat
  • chicken

From Margulies et al. Genome Res. 2003
33
Criteria for Conserved PAS Detection
Scan (window6bp, shift1bp)
conserved block
PolyA signal
Flanking region (at least one)
25bp
25bp
PolyA signal should be conserved in N species and
flanking region has gt65 identity over N species.
N2,3,4,5
More stringent than usual criteria for
identifying selective pressure
34
Conserved PAS candidates Supported by EST Mapping
40
30
Conserved PAS covered
30 of conserved PAS are supported by ESTs
20
10
0
No cons.
N2
N3
N4
N5
N-species conservation
  • We should require at least 3-species conservation
    to consider a PAS as conserved

35
A Significant Fraction of Genes has Conserved PAS
  • Over 22000 annotated human genes
  • gt20 have a putative CONPAS (at least 3-species)
  • 10 have multiple putative CONPAS
  • 7 have a putative CONPAS supported by ESTs
  • Compares to 10-15 conserved alt-splice variants
  • Suggests selective pressure for many polyA site
    sequences in animal genomes

36
Why should PAS be embedded in conserved sequences?
  • Regulatory protein binding site?
  • Regulatory RNA structure?
  • Antisense or miRNA targets?

Probably not
In some cases
Probably most cases
37
Part IIErpin News
38
ERPIN Profile-based RNA Motif Search
Training set
AA
GA
CA
Helix profile (16xN)
Single-strand profile (5xN)
UA
AG
GG
CG
...
UG
Sb1,b2 log(Fb1b2 /Fb1xFb2)
UU
Search algorithm combines dynamic programming for
single strands and profile search for helices
Gautheret Lambert, JMB, 2001, 313, p. 1005.
39
Recent development pseudocounts
A Mir-133 training-set
(( - ((((((( ------ ((( - (((( -------------------
--- )))) ))) ------ ))))))) - )) TC t GGCTGGT
caaac- GGA a CCAA gtccgtcttcctgagaggt--- TTGG TCC
CCTTCA ACCAGCT a CA TG t GGCTGGT caaac- GGA a
CCAA gtcaggtgtttctgtgaggt-- TTGG TCC CCTTCA
ACCAGAC t AT TG t GGCTGGT aaaac- GGA a CCAA
gtcaggtgtttttgtgaggt-- TTGG TCC CCTTCA ACCAGCT a
TG TG c GGCTGGT gaaaa- GGA a CCAC
atcaacccagaaaaaggat--- TTGG TCC CCTTCA ACCAGCC g
CA TA t GGCTGGT caaac- GGA a CCAA
gtccgtcttccttagaggt--- TTGG TCC CCTTCA ACCAGCT a
TT AG t TGCTGGT aaaac- GGA a CCAA
gtcgggtgtttgcgagaggt-- TTGG TCC CTTTCA ACCAGCT a
CT TG t GGCTGGT caaat- GGA a CCAA
gtcaggtgtttctgcgaggt-- TTGG TCC CCTTCA ACCAGCT a
CT
100 CG Other scores log (obs/expected)
abritrary low value! What about GC or AU in
this column? Is it as bad as CC or AG?
Answer fill columns with expected counts, based
on a reasonable model Pseudocounts. Require RNA
bp and ss substitution matrices
40
RNA substitution matrices
Obtained from eukarchaebac 16S/18S rRNA
alignement
AA AT AG AC TA TT TG TC GA GT
GG GC CA CT CG CC 6.54e-04 5.20e-06
3.88e-05 4.22e-05 2.13e-05 5.51e-06 1.21e-05
3.84e-05 8.52e-05 1.28e-05 1.76e-04 2.89e-06
1.47e-05 6.47e-06 3.19e-06 4.69e-06 7.96e-05
9.00e-04 5.19e-05 1.78e-04 1.69e-04 1.43e-04
8.85e-05 1.86e-04 4.15e-05 1.69e-04 1.22e-04
1.99e-04 8.73e-05 2.44e-04 1.25e-04 3.30e-04
1.00e-04 8.72e-06 1.35e-03 1.27e-04 1.72e-05
5.09e-06 3.10e-05 1.38e-04 5.74e-05 1.59e-05
1.01e-04 8.22e-06 9.99e-06 1.62e-05 1.33e-05
2.56e-05 4.11e-05 1.13e-05 4.81e-05 9.79e-04
2.79e-06 7.02e-06 2.79e-06 4.47e-05 4.93e-06
1.97e-05 3.05e-05 8.06e-06 5.40e-06 1.55e-05
2.47e-06 7.30e-05 4.23e-04 2.19e-04 1.33e-04
5.69e-05 1.16e-03 2.21e-04 2.35e-04 2.78e-04
9.59e-05 1.18e-04 1.79e-04 1.08e-04 3.54e-04
2.04e-04 2.24e-04 9.28e-05 1.05e-05 1.80e-05
3.80e-06 1.38e-05 2.14e-05 9.30e-04 2.57e-05
7.75e-05 5.79e-06 2.33e-05 4.87e-05 1.18e-05
1.57e-05 8.72e-05 1.83e-05 5.25e-04 1.05e-04
5.03e-05 1.04e-04 2.49e-05 1.03e-04 1.16e-04
1.14e-03 1.80e-04 4.69e-05 4.56e-05 1.25e-04
4.26e-05 1.70e-04 2.15e-04 7.52e-05 3.23e-05
1.45e-05 4.59e-06 2.03e-05 1.73e-05 5.30e-06
1.52e-05 7.82e-06 1.60e-04 4.55e-06 8.99e-06
4.77e-06 3.66e-06 9.00e-06 6.17e-05 2.95e-06
1.61e-05 2.57e-04 8.19e-06 6.74e-05 1.53e-05
1.46e-05 9.11e-06 1.63e-05 3.64e-05 1.47e-03
2.50e-05 8.70e-05 2.12e-05 3.02e-05 2.83e-05
4.40e-06 8.02e-06 1.24e-04 1.07e-04 6.02e-05
1.96e-04 5.81e-05 1.18e-04 5.10e-05 2.31e-04
8.04e-05 1.28e-03 9.39e-05 8.77e-05 2.53e-05
9.12e-05 3.55e-05 4.58e-05 1.82e-04 8.24e-06
4.08e-05 3.24e-05 9.35e-06 2.61e-05 1.49e-05
1.30e-05 2.97e-05 9.98e-06 5.62e-04 6.96e-06
6.83e-06 8.80e-06 1.32e-05 1.06e-05 1.14e-04
5.14e-04 1.26e-04 3.27e-04 2.16e-04 2.44e-04
1.94e-04 3.84e-04 2.78e-04 3.57e-04 2.67e-04
1.49e-03 1.07e-04 5.26e-04 2.57e-04 2.87e-04
1.30e-05 5.04e-06 3.43e-06 4.90e-06 1.58e-05
7.22e-06 1.73e-05 2.10e-05 8.85e-06 2.30e-06
5.85e-06 2.40e-06 5.30e-04 5.30e-05 1.58e-05
7.68e-06 3.86e-06 9.54e-06 3.78e-06 9.51e-06
6.16e-06 2.71e-05 1.48e-05 9.78e-05 5.60e-06
5.61e-06 5.10e-06 7.94e-06 3.58e-05 2.95e-04
5.22e-06 3.52e-05 1.04e-04 2.68e-04 1.70e-04
8.32e-05 3.71e-04 3.12e-04 2.83e-04 2.56e-04
4.77e-05 1.19e-04 4.21e-04 2.12e-04 5.86e-04
2.86e-04 1.35e-03 2.50e-04 2.13e-06 9.81e-06
4.54e-06 3.41e-05 2.12e-06 1.24e-04 1.69e-06
1.94e-05 1.21e-06 2.14e-06 4.70e-06 3.31e-06
3.95e-06 2.68e-05 3.48e-06 5.45e-04     A T
G C 9.13e-04 8.22e-05 1.05e-04 9.35e-05
5.57e-05 6.70e-04 7.98e-05 1.41e-04 6.94e-05
7.78e-05 7.32e-04 5.03e-05 4.09e-05 9.15e-05
3.33e-05 6.03e-04
AA AT AG AC TA TT TG TC GA GT GG GC CA CT CG CC
A T G C
41
Recent developmentE-values for RNA Motifs
Based on discrete convolution analysis of profiles
simulated
computed
42
The ERPIN Server
http//tagc.univ-mrs.fr/erpin/
All searches parameterized to scan a bacterial
genome in less than 5 minutes
43
Micro-RNA Search
  • 18 training sets build for 18 miRNA families
  • Using CLUSTALW Alifold
  • 10 sequences in average

Legendre, Lambert, Gautheret, Bioinformatics 2004
44
ERPIN vs WU-BLAST
  • 20 animal genomes scanned
  • Sensitive WU-BLAST parameters (W7)
  • E-value ? 0.01

WU-BLAST
ERPIN
43 (0)
45
Analysis of a miR Cluster
miR17 cluster
ciona
ciona
Grey initial training set E indicates hits
identified by ERPIN only, EB indicates hits
identified by both ERPIN and BLAST.
  • Important homologues missed by WU-BLAST
  • Profile search a must in miRNA detection

46
TAGC lab RNA Bioinformatics
  • Takeshi Ara
  • Fabrice Lopez
  • Matthieu Legendre
  • William Ritchie
  • Daniel Gautheret
Write a Comment
User Comments (0)
About PowerShow.com