Title: Exploring the transcriptome
1Exploring the transcriptome
- From transcripts to genes
2Transcriptome sequencing
-  Traditional sequencing
- cDNA clones isolated on the basis of some
functional property of interest to a group - EST sequencing
- Large-scale sampling of end sequences of all cDNA
clones present in a library -  Full-length sequencing
- Systematic attemps to obtain high-quality
sequences of cDNA clones representing all
transcribed genes
3Why EST sequencing?
- Systematic sampling of the transcribed portion of
the genome (transcriptome) - Provides sequence tags allowing unique
identification of genes (e.g. for SAGE) - Provides experimental evidence for the positions
of exons - Provides regions coding for potentially new
proteins - Provides clones for DNA microarrays
4ESTs - the basics
- cDNA libraries prepared from various tissues and
cell lines, using directional cloning - Gridding of individual clones using robots
- For each clone, sequencing of both ends of insert
in single pass - Deposit readable part of sequence in database
5cDNA libraries
- Most are native, meaning that clone frequency
reflects mRNA abundance - Most are primed with oligo(dT), meaning that 3
ends are heavily represented - The complexity of libraries is extremely variable
- Normalized libraries are used to enrich for
rare mRNAs
6cDNA libraries used
- Large number of libraries represented
- Most libraries managed by the IMAGE consortium
- Human and mouse libraries most abundantly
represented at IMAGE - Systematic effort to make libraries from
cancerous tissue CGAP project (NCI) - Many tissues still not sampled
- Quality very uneven
7Clone availability
- In principle, all clones produced by IMAGE are
publicly available - Distributors ATCC, Incyte and Invitrogen in US,
HGMP (UK) and RZPD (Germany) in Europe - Error rate is high about 30 chance that clone
doesnt have expected sequence - Invitrogen sells sets of sequence verified clones
8EST databases
- EMBL/GenBank have separate sections for EST
sequences - ESTs are the most abundant entries in the
databases (gt60) - ESTs are now separated by division in the
databases - EMBL human, mouse, plant, prokaryote, etc.
- EST sequences are submitted in bulk, but do have
to meet minimal quality criteria
9EST entries in EMBL (1)
ID AI242177 standard RNA EST 581 BP. AC
AI242177 SV AI242177.1 DT 05-NOV-1998 (Rel.
57, Created) DT 03-MAR-2000 (Rel. 63, Last
updated, Version 3) DE qh81g08.x1
Soares_fetal_liver_spleen_1NFLS_S1 Homo sapiens
cDNA DE clone IMAGE1851134 3' similar to
gbM10988 TUMOR NECROSIS FACTOR DE PRECURSOR
(HUMAN), mRNA sequence. RN 1 RP 1-581 RA
NCI-CGAP RT National Cancer Institute, Cancer
Genome Anatomy Project (CGAP), Tumor RT Gene
Index http//www.ncbi.nlm.nih.gov/ncicgap RL
Unpublished. DR RZPD IMAGp998P154529
IMAGp998P154529. CC On May 19, 1998 this
sequence version replaced gi2846208. CC
Contact Robert Strausberg, Ph.D. CC Tel (301)
496-1550 CC Email Robert_Strausberg_at_nih.gov CC
This clone is available royalty-free through
LLNL contact the CC IMAGE Consortium
(info_at_image.llnl.gov) for further information. CC
Insert Length 1280 Std Error 0.00 CC Seq
primer -40UP from Gibco CC High quality
sequence stop 463.
10EST entries in EMBL (2)
FH Key Location/Qualifiers FH FT
source 1..581 FT
/db_xreftaxon9606 FT
/db_xrefESTLIB452 FT
/db_xrefRZPDIMAGp998P154529 FT
/noteOrgan Liver and Spleen Vector pT7T3D
(Pharmacia) FT with a modified
polylinker Site_1 Pac I Site_2 Eco RI FT
This is a subtracted version of the
original Soares fetal FT liver
spleen 1NFLS library. 1st strand cDNA was
primed FT with a Pac I -
oligo(dT) primer 5' FT
AACTGGAAGAATTAATTAAAGATCTTTTTTTTTTTTTTTTTTT
3', FT double-stranded cDNA
was ligated to Eco RI adaptors FT
(Pharmacia), digested with Pac I and cloned
into the Pac I FT and Eco RI
sites of the modified pT7T3 vector. Library FT
went through one round of
normalization. Library FT
constructed by Bento Soares and M.Fatima
Bonaldo. FT /sexmale FT
/organismHomo sapiens FT
/cloneIMAGE1851134 FT
/clone_libSoares_fetal_liver_spleen_1NFLS_S1 FT
/dev_stage20 week-post
conception fetus FT
/lab_hostDH10B (ampicillin resistant)
11EST sequence quality
- Single, unverified runs, so quality is on average
low (and worsening towards 3) - Current submission rules require Phred score gt
20 (lt1 error) documented - Trivial contaminants are common (vector, rRNA,
mitRNA, other species ) - Frameshift errors are very common
- For many reads, trace files can be retrieved
12Private EST databases
- Producing and selling access to EST data has
proven to be a lucrative business - Incyte has created the LifeSeq databases, to
which access can be bought - Human Genome Sciences has preferred to exploit
the data itself, and to get patents on promising
genes found in its databases
13Why search EST databases?
- Alternative to library screening a short tag can
lead to a cDNA clone - Alternative to cDNA sequencing sequences of
multiple ESTs can reconstitute a full-length cDNA - Gene discovery EST sequences often contain new
members of known families
14BLAST searching EST databases
- Many sites where EST databases can be searched
with BLAST, including our own - Searching raw EST data is becoming less
attractive, as for many species ESTs have been
incorporated into transcript assemblies - There are at present no  consensus transcript
assemblies
15Problems with raw EST databases
- The databases are highly redundant e.g. 5x106
human sequences for 4x104 genes - The databases are skewed for sequences near 3
end of mRNAs - The error rates are high in individual ESTs
- For most ESTs, there is no indication as to the
gene from which is was derived
16The ORESTES project
- Goal to obtain EST sequences from the
under-represented, often coding, central portions
of mRNAs - Methodology use low-stringency semi-random
priming followed by PCR, producing low complexity
libraries - Results over 1000000 ESTs produced, of which
half produce novel information
17How to organize EST collections?
- Clustering associate individual EST sequences
with unique transcripts or genes - Assembling derive consensus sequences from
overlapping ESTs belonging to the same cluster - Mapping associate ESTs (or EST contigs) with
exons in genomic sequences - Interpreting find and correct coding regions
18The Unigene database
- Unigene is an ongoing effort at NCBI to cluster
EST sequences with traditional gene sequences - For each cluster, there is a lot of additional
information included - Unigene is regularly rebuilt. Therefore, cluster
identifiers are not stable gene indices
19The TIGR Gene Indices
- The Gene Indices are based on contig assemblies
rather than true clustering strategies - There are more Gene Indices than Unigene clusters
- TIGR maintains EGAD, a database of known human
transcripts - EGAD entries are included in the Human Gene
Indices
20Other EST cluster and assembly databases
- SANBI in South Africa produces the STACK
collection of human EST contigs - MIPS in Munich and the SIB produce
BLAST-searchable contigs from Unigene - TIGEM in Italy has a nice collection of EST
search and assembly tools, local remote - The CBIL at the U. of Pennsylvania has assembled
the DOTS database, part of the AllGenes project
21Strategy for gene discovery
- Cluster EST sequences to identify candidate
transcripts - Assemble to increase length and reduce redundancy
- Find coding regions and reading frame, and
correct frameshift error - Use deduced protein sequences as searchable
database (TrEST)
22Interpreting ESTs in the context of the genome
- The genome sequence provides a reliable scaffold
for clustering ESTs - ESTs clustered on the genome can be assembled by
following exon/intron boundaries - The combination of genome, full-length insert and
EST data will provide the reference dataset for
gene and transcript annotation
23Using EST data for gene expression analysis
- Basic idea compare relative abundance of
specific cDNAs in libraries of different origins - Best tools available Digital Gene Expression
Displayer (DGED) and xProfiler at the CGAP Web
site - Alternative browse through genes unique to
different libraries
24Full-length cDNA sequencing
- Principle make high-quality cDNA libraries,
often optimized for full-length inserts - Do first-pass sequencing of 3 and 5 ends, and
select good clones one per transcript,
contains initiation codon and poly(A) - For selected clones, fully sequence the inserts
25Full-length sequencing projects
- MGC (at NCI) goal of 20000 sequences a year
starting in 2001, human and mouse actually
produced 15000 human and 11000 mouse - RIKEN Genome Science Lab, Tsukuba produced and
annotated 60000 full-length mouse sequences
(FANTOM2) - Kazusa Institute HUGE project set of over 2000
KIAA human cDNA sequences (large proteins) - NEDO human cDNA project over 20000 FLJ
sequences available in early 2003 - German cDNA consortium (DKFZ) produced over 3200
sequences, current state of project unclear
26The future of ESTs
- In human and mouse, most will come as byproducts
of full-length projects, - There are good arguments for trying to reach
saturation on selected tissues - ESTs are still the tool of choice for rapid
exploration of the transcriptomes of various
species, especially with large genomes - ESTs could form a very solid basis for
evolutionary studies
27The need for a gene index
- All high-throughput biology methods require a
unique and reliable way to describe the genes
they are analysing - This index should be stable, unique, extensible,
and independent of a system of nomenclature - The index should document all transcript
sequences belonging to the corresponding gene
28Some commonly used indices
- EMBL/GenBank/DDBJ accession numbers
- Unique and universally accepted BUT
- Highly redundant (many entries per gene)
- Unigene cluster identifiers (NCBI)
- Widely used and non-redundant BUT
- Rely on clustering procedure (unreliable) AND
- Unstable clusters change with each build
- RefSeq accession numbers (NCBI)
- Stable and non-redundant BUT
- Still very far from comprehensive AND
- Many RefSeq sequences are incomplete AND
- Splice variants are not systematically documented
29Emerging gene indices
- HUGO Gene Nomenclature Committee
- Assigns unique symbols to human genes with a
known or putative function - About 15000 symbols assigned so far
- No attempt to link to sequence data or variants
- Ensembl
- Uses  conservative pipeline to predict genes
and transcripts - 25000 predicted genes and 37000 transcripts so
far - Provides unique identifiers for genes and
transcripts - Linked to HGNC names when possible
30Gene documentation resources
- Repositories collecting info about loci from
multiple sources - LocusLink (NCBI) links mostly to NCBI internal
information sources - GeneCards (Weizmann Inst.) uses Web robots to
collect information, probably most complete
resource available today - GeneLynx (Karolinska Inst.) similar to
GeneCards, but also collects EST information
31 tromer goals
- To combine the draft genome and transcriptome
into a database documenting all human transcripts - To design a rational representation and indexing
system for human genes - To reconstitute transcripts and their variants
using genomic DNA as template
32Methods
- Create a comprehensive and non-redundant set of
mRNA 3 ends, documented by polyadenylation sites
mapped to the genome - Create a comprehensive mapping of EST sequences
to exons on the genome - Use this to create a database of exons and
introns, defined by their coordinates on the
genome - Combine these two data sets to reconstitute
transcripts
33Poly(A) site documentation
- Re-extract EST sequences from trace files (when
available) - Extract 50 nt adjacent to poly(A)
- Make this collection non-redundant
- Map all tags to the draft genome
- Validate the quality of the tags
34Generation of the 3 tags
35Example of a  bad tag
ID 3P016851 SQ 1 0 AGCACTGATGTTGAATTTCACTTTA
AGTTGTCACATGGTCTGAGTTGTACaaaaagaaagaaaaattctggccgg
gtgtggtggctcatgcctgtaatcc PA Score-315
matchnegative QQ A13 C0 G2 T0
testbad CM nz76g10.s1AA764893 -, 18 As in
polyA CO qt57h10.x1AI355333 GN AP000848_32
pos 981 CHR 11 GN AP001320_19 pos
2151 CHR 11 GN AC011875_21 pos 1166 CHR
NA //
Example of a  good tag
ID 3P016857 SQ 1 0 AGCACTGTCTTATCACATCGCCAAT
TAGTTGTAATAAACGTTCAACGTACaaacactggagtgtgtttttatctc
tttccaaaagtttgtcaaactatgc PA Score 117
matchstrong QQ A5 C2 G5 T3 testok CM
nk75b07.s1AA551090 -, 17 As in polyA CO
ob29c06.s1AA741544 qm22a02.x1AI278652
wj07f01.x1AI824173 wm06d04.x1AI887507 CO
xf19g12.x1AW131962 yw01c11.s1H96551
yy23c08.s1N35379 zl39e04.s1AA149587 CO
zq01c05.s1AA193118 zv54a07.s1AA437212 GN
AC004106_1 - pos 34804 CHR X SQ 2 4
CTGTCTTATCACATCGCCAATTAGTTGTAATAAACGTTCAACGTACAAAC
actggagtgtgtttttatctctttccaaaagtttgtcaaactatgcagag
PA Score 134 matchstrong QQ A2 C1 G5
T7 testok CM aa28a05.r1AA480914 -, 16 As
in polyA CO qe01d07.x1AI143977
qu50g08.x1AI285755 yh71a05.s1R32806
yv79c01.r1H82355 CO zn91a12.s1AA127331
zn92a12.s1AA127424 GN AC004106_1 - pos
34800 CHR X //
36Exon and intron mapping
- Document all matches between genome contigs and
RNAs (pairwise matching) using BLAST (gt5x106
matches) - Create pairwise alignments between RNAs and
genome contigs using sim4, document exon and
intron boundaries - Import data into ACEDB for exploration and
visualization
37gtchrAC018734_9AC018734.3 Chromosome 21 Clone
RP11-351D2 SC WashU.Homo sapiens
LEN30562 gtestAW468733AW468733.1-
LEN413 16806-16850 (1-45) 97 -gt 18234-18383
(46-195) 100 -gt 19199-19266 (196-263) 100
-gt 19463-19612 (264-413) 100 0 .
. . . .
16806 CCCGTCGGGAGACACGTTTTGTCAGTTGCTTCCTCTGGAGTCCT
TGTG..
gtgtgt.. 1 CCNGTCGGGAGACACGTTTTGTCAGT
TGCTTCCTCTGGAGTCCTT 50 . .
. . . 16856
.AAGTTAAAAGACTACTTTTTAGAGCAGTTTTCATTTTACCGATACATCG
.gtgtgt
46 TTAAAAGACTACTTTTTAGAGCAGTTTT
CATTTTACCGATACATCG 100 . .
. . . 18280
ATCAGAGAGTACGGAGCTTCCCACATTGCCTCCCGCACCCCATAGCAGGC
92 ATCAGAGAGTACGGAGCTTCCCACATTGCCTC
CCGCACCCCATAGCAGGC 150 . .
. . . 18330
AGAGGCACCAGCCAGGCGCCATCCTTTACAAGCTGTTTCAGCAAAATGTC
142 AGAGGCACCAGCCAGGCGCCATCCTTTACAAG
CTGTTTCAGCAAAATGTC 200 . .
. . . 18380
CCTGGTG...CAGGGTCTTAATGATTAAGTCCAGGCACTCCAAGGACGAT
gtgtgt...gtgtgt
192 CCTG GGTCTTAATGATTAAGTCC
AGGCACTCCAAGGACGAT 250 . .
. . . 19236
TAAATATTTAAACAGCAGCTTCCCTTTGAAGGTC...CAGGGAAGGAAAT
gtgtgt...gtgtgt
233 TAAATATTTAAACAGCAGCTTCCCTTTGAAG
GGAAGGAAAT 300 . .
. . . 19473
GACCGCAGAAGGGAGTGCACCGTGGGGGCAGGAGGTGCTGCGGCCAATGG
274 GACCGCAGAAGGGAGTGCACCGTGGGGGCAGG
AGGTGCTGCGGCCAATGG 350 . .
. . . 19523
GCAGTGCTGGGAGATGAAACCAACAGCGGCGATTGCCCCAGAAAGAACTT
324 GCAGTGCTGGGAGATGAAACCAACAGCGGCGA
TTGCCCCAGAAAGAACTT 400 . .
. . 19573 GGCGCTGCTAAGGCAATGGCGGC
CCGAAAGAGGCCCCTGA
374 GGCGCTGCTAAGGCAATGGCGGCCC
GAAAGAGGCCCCTGA
Output of sim4 program
38NCAM2
Hs.177691
Hs.135892
DKFZp761I1311
Hs.76118
39COL18A1
Hs.78409
3 UTR overlap
Hs. 84190
SLC19A1
40Alternative polyadenylation
- Affects at least half of all genes
- Estimate based on manual examination of 3.5 MB
contig from chr 21 - Exists at both micro (lt100 bp) and macro (gt100
bp) levels - Often extends over distances of several kilobases
- Impacts on microarray probe design as well as
SAGE data interpretation
41Reconstituting transcripts (1)
- Data cleanup
- Decide which transcript to genome matches are
reliable (gene families, pseudogenes, genome
repeats, etc) - Within alignments, decide which parts are bona
fide exons - Exon generation
- Determine directionality, based on splice
junctions and RNA annotations - Determine exon boundaries and completeness
42Reconstituting transcripts (2)
- Merge multiple occurrences of same exon,
respecting boundary rules - E.g. overlapping exons with different splicing
patterns are not merged - Merge multiple occurrences of same splice
junctions - Resolve differences due to noise in data from
real alternative splice junctions
43A
3tags
9,995
Alu
3,592
3,674
5,734
5,779
6,122
6,152
6,267
6,355
6,577
6,749
8,162
8,505
1
2
3
4
5
6
7
C
B
1
2
3
4
5
6
E
A
E
7
44Genes as graphs
B0 5600 5/1 gt D0 5/1 D0 6769
5/1 B0 5/1 gt -gt A0 5/1 A0 8755
5/1 D0 5/1 -gt gt D1 5/1 D1 9013
5/1 A0 5/1 gt -gt A1 5/1 B1 10515
2/1 gt D2 2/1 A1 10929 5/1 D1 5/1
-gt gt D2 5/1 D2 11092 7/2 B1 2/1 gt
A1 5/1 gt -gt A2 7/2 A2 11673 10/1
D2 10/1 -gt gt E0 10/1 E0 12486 10/1
3P008703 A2 10/1 gt
- Nodes are connected by edges
- Edges are exons or introns
- Nodes are- B initiation- D splice donor-
A splice acceptor- E polyadenylation site
B0
D0
A0
D1
B1
A1
D2
A2
E0
45Reconstituting transcripts (3)
- Generate graphs for each gene in a genomic
segment - A gene is a fully connected graph
- Graphs contain nodes and edges linking these
nodes (exons) - Generate the most likely transcripts for each
gene - A transcript is a path through the graph
- The decision on which paths to keep or reject is
based on the frequencies with which edges are
documented
46Towards a gene index?
- The graph nodes are sequence based and can easily
be made non-redundant - The connectivity between nodes can be completed
as more data become available - The gene index derived from graphs includes a
full representation of the locus, including all
splice variants and all of the primary transcript
data (ESTs, full-length)
47What are the uses of the graphs?
- Solid experimental basis for designing microarray
probes (3 ends) - Mapping of SAGE and MPSS tags
- Exploration of the diversity of protein coding
potential (alternative splicing) - Gene discovery based on expression patterns,
either for whole locus or exon by exon - Exploration of the complexity of genes