Title: EST sequences and databases Exploring the transcriptome Why
1EST sequences and databases
- Exploring the transcriptome
2Why EST sequencing?
- Systematic sampling of the transcribed portion of
the genome (transcriptome) - Provides sequence tags allowing unique
identification of genes (e.g. for SAGE) - Provides experimental evidence for the positions
of exons - Provides regions coding for potentially new
proteins - Provides clones for DNA microarrays
3ESTs - the basics
- cDNA libraries prepared from various tissues and
cell lines, using directional cloning - Gridding of individual clones using robots
- For each clone, sequencing of both ends of insert
in single pass - Deposit readable part of sequence in database
4cDNA libraries
- Most are native, meaning that clone frequency
reflects mRNA abundance - Most are primed with oligo(dT), meaning that 3
ends are heavily represented - The complexity of libraries is extremely variable
- Normalized libraries are used to enrich for
rare mRNAs
5cDNA libraries used
- Currently 2225 libraries represented
- Most libraries managed by the IMAGE consortium
- Human (over 300) and mouse (75) libraries most
abundantly represented at IMAGE - Systematic effort to make libraries from
cancerous tissue CGAP project (NCI) - Many tissues still not sampled
- Quality very uneven
6Clone availability
- In principle, all clones produced by IMAGE are
publicly available - Distributors ATCC, Incyte and Research Genetics
in US, HGMP (UK) and RZPD (Germany) in Europe - Error rate is high about 30 chance that clone
doesnt have expected sequence - Research Genetics sells sets of sequence verified
clones
7EST databases
- EMBL/GenBank have separate sections for EST
sequences - ESTs are the most abundant entries in the
databases (gt60) - ESTs are not separated by species in the
databases - EST sequences are submitted in bulk, but do have
to meet minimal quality criteria
8EST entries in EMBL (1)
ID AI242177 standard RNA EST 581 BP. AC
AI242177 SV AI242177.1 DT 05-NOV-1998 (Rel.
57, Created) DT 03-MAR-2000 (Rel. 63, Last
updated, Version 3) DE qh81g08.x1
Soares_fetal_liver_spleen_1NFLS_S1 Homo sapiens
cDNA DE clone IMAGE1851134 3' similar to
gbM10988 TUMOR NECROSIS FACTOR DE PRECURSOR
(HUMAN), mRNA sequence. RN 1 RP 1-581 RA
NCI-CGAP RT National Cancer Institute, Cancer
Genome Anatomy Project (CGAP), Tumor RT Gene
Index http//www.ncbi.nlm.nih.gov/ncicgap RL
Unpublished. DR RZPD IMAGp998P154529
IMAGp998P154529. CC On May 19, 1998 this
sequence version replaced gi2846208. CC
Contact Robert Strausberg, Ph.D. CC Tel (301)
496-1550 CC Email Robert_Strausberg_at_nih.gov CC
This clone is available royalty-free through
LLNL contact the CC IMAGE Consortium
(info_at_image.llnl.gov) for further information. CC
Insert Length 1280 Std Error 0.00 CC Seq
primer -40UP from Gibco CC High quality
sequence stop 463.
9EST entries in EMBL (2)
FH Key Location/Qualifiers FH FT
source 1..581 FT
/db_xreftaxon9606 FT
/db_xrefESTLIB452 FT
/db_xrefRZPDIMAGp998P154529 FT
/noteOrgan Liver and Spleen Vector pT7T3D
(Pharmacia) FT with a modified
polylinker Site_1 Pac I Site_2 Eco RI FT
This is a subtracted version of the
original Soares fetal FT liver
spleen 1NFLS library. 1st strand cDNA was
primed FT with a Pac I -
oligo(dT) primer 5' FT
AACTGGAAGAATTAATTAAAGATCTTTTTTTTTTTTTTTTTTT
3', FT double-stranded cDNA
was ligated to Eco RI adaptors FT
(Pharmacia), digested with Pac I and cloned
into the Pac I FT and Eco RI
sites of the modified pT7T3 vector. Library FT
went through one round of
normalization. Library FT
constructed by Bento Soares and M.Fatima
Bonaldo. FT /sexmale FT
/organismHomo sapiens FT
/cloneIMAGE1851134 FT
/clone_libSoares_fetal_liver_spleen_1NFLS_S1 FT
/dev_stage20 week-post
conception fetus FT
/lab_hostDH10B (ampicillin resistant)
10EST sequence quality
- Single, unverified runs, so quality is on average
low - Current submission rules require Phred score gt
20 (lt1 error) documented - Trivial contaminants are common (vector, rRNA,
mitRNA, other species ) - Frameshift errors are very common
- For many reads, trace files can be retrieved
11Private EST databases
- Producing and selling access to EST data has
proven to be a lucrative business - Incyte has created the LifeSeq databases, to
which access can be bought - Human Genome Sciences has preferred to exploit
the data itself, and to get patents on promising
genes found in its databases
12Why search EST databases?
- Alternative to library screening a short tag can
lead to a cDNA clone - Alternative to cDNA sequencing sequences of
multiple ESTs can reconstitute a full-length cDNA - Gene discovery EST sequences often contain new
members of known families
13BLAST searching EST databases
- Many sites where EST databases can be searched
with BLAST, including our own - As EST sections of EMBL/GenBank are not separated
by species, one should choose server carefully
(e.g. NCBI has human, mouse, others, EBI has no
species choice) - Searching large EST collections with a protein
query (tblastn) can take very long...
14Problems with raw EST databases
- The databases are highly redundant e.g. 1.7x106
human sequences for 105 genes - The databases are skewed for sequences near 3
end of mRNAs - The error rates are high in individual ESTs
- For most ESTs, there is no indication as to the
gene from which is was derived
15The ORESTES project
- Goal to obtain EST sequences from the
under-represented, often coding, central portions
of mRNAs - Methodology use low-stringency semi-random
priming followed by PCR, producing low complexity
libraries - Results over 250000 ESTs produced, of which
half produce novel information
16How to organize EST collections?
- Clustering associate individual EST sequences
with unique transcripts or genes - Assembling derive consensus sequences from
overlapping ESTs belonging to the same cluster - Mapping associate ESTs (or EST contigs) with
exons in genomic sequences - Interpreting find and correct coding regions
17The Unigene database
- Unigene is an ongoing effort at NCBI to cluster
EST sequences with traditional gene sequences - For each cluster, there is a lot of additional
information included - Unigene is regularly rebuilt. Therefore, cluster
identifiers are not stable gene indices
18The TIGR Gene Indices
- The Gene Indices are based on contig assemblies
rather than true clustering strategies - There are more Gene Indices than Unigene clusters
- TIGR maintains EGAD, a database of known human
transcripts - EGAD entries are included in the Human Gene
Indices
19Other EST cluster and assembly databases
- SANBI in South Africa produces the STACK
collection of human EST contigs - MIPS in Munich and the SIB produce
BLAST-searchable contigs from Unigene - TIGEM in Italy has a nice collection of EST
search and assembly tools, local remote - The CBIL at the U. of Pennsylvania has assembled
the DOTS database
20Strategy for gene discovery
- Cluster EST sequences to identify candidate
transcripts - Assemble to increase length and reduce redundancy
- Find coding regions and reading frame, and
correct frameshift error - Use deduced protein sequences as searchable
database (TrEST)
21ESTScan
- A program to recognize and correct coding regions
in EST sequences
22Design goals
- Discrimination between coding and non-coding
sequences - Detection of beginning and end of coding regions
- Correction of frameshift errors
- Tolerance to artefactual stop codons
23Methodology
- Detection of coding regions use known bias in
hexanucleotide frequencies - Implemented as a fifth-order, inhomogeneous
Markov model (taken from C. Burge and S. Karlin) - Modify the model to accommodate insertions and
deletions
24Markov models
- Markov model probability of occurrence of a
symbol at one position depends on values of
preceding positions - Hidden Markov model the observed sequence is
produced from a hidden Markov model whose states
emit symbols with different probabilities
25Types of Markov models
a) Three periodic fifth order Markov modelb)
Homogeneous fifth order Markov modelc) Hidden
Markov modelBurge Karlin, Curr. Opinion
Struct. Biol. 8346, 1998
26The ESTScan HMM
27Normalization
- Goal establish score boundaries between coding
and non-coding regions - Problem raw scores are dependent on both length
and GC content - Approach establish an empirical system based on
observed behavior of the algorithm
28Normalization strategy
- Create pseudo-EST databases devoid of coding
potential, with entries of variable length - Split these on the basis of GC content
- Look at score distributions, in terms of the
proportion of false positives above a given
cutoff
29Behavior of scoring system
Cutoff values used for score normalization. Shuffl
ed EST sequence databases with entries of uniform
length were produced as described. The isochore
classes were I lt43 GC II 43-51, III
51-57, and IV gt 57. The scores for all entries
in each database were calculated, and the cutoff
score for false positive rates of f calculated.
The actual table used by ESTScan contains a
larger number of entries, based on more values
for f and length.
30Effects of normalization
- Normalized score log(raw/cutoff) x 100
- Positive scores denote coding regions, negative
non-coding - The acceptable level of false positives can be
chosen by the user
31ESTScan output
The output header line contains the following
informationgtxxyyyyyzzzzz Normalized_score
Raw_score Cutoff Start_coding Stop_coding
gtemblAI740923accAI740923 39.6 376 253 1 297
Homo sapienswg18d05.x1 Soares_NSF_F8_9W_OT_PA_P
_S1 Homo sapiens cDNA clone IMAGE2365449 3
similar to TRQ15597 Q15597 TRANSLATION
INITIATION FACTOR EIF-4GAMMA TAVIKQRVPILLKYLDSDTE
KELQALYALQASIVKLDQPANLLRMFFDCLYDEEVISEDA FYKWESSKD
PAEQNGKGVALKSVTAFFTWLREAEEESED ACTGCTGTTATCAAGCAG
AGAGTGCCGATCTTACTCAAGTACCTAGACTCAGATACAGAG AAGGAAC
TGCAAGCACTTTATGCACTACAAGCATCGATAGTAAAACTTGATCAACCT
GCC AATTTGCTGCGGATGTTTTTTGATTGTCTATATGACGAGGAGGTGA
TCTCCGAGGATGCC TTCTACAAATGGGAGAGCAGCAAGGACCCTGCAGA
GCAGAATGGGAAGGGCGTGGCTCTG AAATCTGTCACGGCATTCTTCACG
TGGCTGCGGGAAGCAGAAGAGGAGTCTGAGGATaac taaaacttcaaat
acccaaaatgaaacaaaagaaacaatttaagtatttttttaaaaaag tt
tcacgtcttcgccaatcacagtgcagcaaggccaattctcgcagaaaccc
ccacgtgt gcacgagtgggagaggggaaagagaaaaaaaggtgatcatg
gaggaaaaaggtactggat aaaagtaaacttcaaaccttagggcgggag
cactaaaaccaaaaaaaaaaaa
32Test data for CDS recognition
- Coding regions isochore-separated sets from
GENIO - 3UTR annotated regions from human genes,
extracted by SRS - ESTs identified by BLAST (5 best scores), and
trimmed by xblast
33CDS recognition tests
- For each sequence in test set, calculate
normalized score on both strands and report best
of two - Plot data as regular or cumulative frequency
histograms - Estimate discriminant power from overlap between
test sets
34CDS recognition - results
CDS
EST
35CDS recognition - improvements
- Introduce explicit models for non-coding regions,
especially 3 UTRs - Introduce a translation initiation recognition
module - Re-optimize the HMMs CDS scoring parameters on
EST data
36Frame recognition and correction
- Test set CDS extracted from all human
non-Unigene ESTs by ESTScan - Compare results of blastp against this set and
tblastn against same EST collection (32 fold
larger) - Result equivalent of seq. lost by ESTScan or
added to match list because of frame corrections
37Frame recognition and correction
- Informal testing took 2 sequences known to
contain frameshift errors - First contains frameshift near 3 end that does
not affect length of CDS - corrected by ESTScan - Second is EST with multiple sequencing errors
including two frameshifts
38Uses of ESTScan
- Flag ESTs that contain CDS for further analysis
(e.g. in ORESTES) - Create virtual protein sequence databases from
ESTs or EST contigs - Detect and correct reading frame during cDNA
sequencing (e.g. in SEREX project)
39ESTScan - the future
- Include non-CDS and translation initiation
recognition - Coding exon recognition in low-quality genomic
sequences - Produce parameter and normalization tables for
more species (e.g. Drosophila) - Integrate with contig assembly
40Using EST data for gene expression analysis
- Basic idea compare relative abundance of
specific cDNAs in libraries of different origins - Best tools available Digital Differential
Display (DDD) and xProfiler at the CGAP Web site - Alternative browse through genes unique to
different libraries