EST sequences and databases Exploring the transcriptome Why - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

EST sequences and databases Exploring the transcriptome Why

Description:

EST sequences and databases Exploring the transcriptome Why EST sequencing? Systematic sampling of the transcribed portion of the genome ( transcriptome ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 41

Provided by: chEmbnetO

Category:

more less

Transcript and Presenter's Notes

Title: EST sequences and databases Exploring the transcriptome Why

1
EST sequences and databases

Exploring the transcriptome

2
Why EST sequencing?

Systematic sampling of the transcribed portion of
the genome (transcriptome)
Provides sequence tags allowing unique
identification of genes (e.g. for SAGE)
Provides experimental evidence for the positions
of exons
Provides regions coding for potentially new
proteins
Provides clones for DNA microarrays

3
ESTs - the basics

cDNA libraries prepared from various tissues and
cell lines, using directional cloning
Gridding of individual clones using robots
For each clone, sequencing of both ends of insert
in single pass
Deposit readable part of sequence in database

4
cDNA libraries

Most are native, meaning that clone frequency
reflects mRNA abundance
Most are primed with oligo(dT), meaning that 3
ends are heavily represented
The complexity of libraries is extremely variable
Normalized libraries are used to enrich for
rare mRNAs

5
cDNA libraries used

Currently 2225 libraries represented
Most libraries managed by the IMAGE consortium
Human (over 300) and mouse (75) libraries most
abundantly represented at IMAGE
Systematic effort to make libraries from
cancerous tissue CGAP project (NCI)
Many tissues still not sampled
Quality very uneven

6
Clone availability

In principle, all clones produced by IMAGE are
publicly available
Distributors ATCC, Incyte and Research Genetics
in US, HGMP (UK) and RZPD (Germany) in Europe
Error rate is high about 30 chance that clone
doesnt have expected sequence
Research Genetics sells sets of sequence verified
clones

7
EST databases

EMBL/GenBank have separate sections for EST
sequences
ESTs are the most abundant entries in the
databases (gt60)
ESTs are not separated by species in the
databases
EST sequences are submitted in bulk, but do have
to meet minimal quality criteria

8
EST entries in EMBL (1)
ID AI242177 standard RNA EST 581 BP. AC
AI242177 SV AI242177.1 DT 05-NOV-1998 (Rel.
57, Created) DT 03-MAR-2000 (Rel. 63, Last
updated, Version 3) DE qh81g08.x1
Soares_fetal_liver_spleen_1NFLS_S1 Homo sapiens
cDNA DE clone IMAGE1851134 3' similar to
gbM10988 TUMOR NECROSIS FACTOR DE PRECURSOR
(HUMAN), mRNA sequence. RN 1 RP 1-581 RA
NCI-CGAP RT National Cancer Institute, Cancer
Genome Anatomy Project (CGAP), Tumor RT Gene
Index http//www.ncbi.nlm.nih.gov/ncicgap RL
Unpublished. DR RZPD IMAGp998P154529
IMAGp998P154529. CC On May 19, 1998 this
sequence version replaced gi2846208. CC
Contact Robert Strausberg, Ph.D. CC Tel (301)
496-1550 CC Email Robert_Strausberg_at_nih.gov CC
This clone is available royalty-free through
LLNL contact the CC IMAGE Consortium
(info_at_image.llnl.gov) for further information. CC
Insert Length 1280 Std Error 0.00 CC Seq
primer -40UP from Gibco CC High quality
sequence stop 463.
9
EST entries in EMBL (2)
FH Key Location/Qualifiers FH FT
source 1..581 FT
/db_xreftaxon9606 FT
/db_xrefESTLIB452 FT
/db_xrefRZPDIMAGp998P154529 FT
/noteOrgan Liver and Spleen Vector pT7T3D
(Pharmacia) FT with a modified
polylinker Site_1 Pac I Site_2 Eco RI FT
This is a subtracted version of the
original Soares fetal FT liver
spleen 1NFLS library. 1st strand cDNA was
primed FT with a Pac I -
oligo(dT) primer 5' FT
AACTGGAAGAATTAATTAAAGATCTTTTTTTTTTTTTTTTTTT
3', FT double-stranded cDNA
was ligated to Eco RI adaptors FT
(Pharmacia), digested with Pac I and cloned
into the Pac I FT and Eco RI
sites of the modified pT7T3 vector. Library FT
went through one round of
normalization. Library FT
constructed by Bento Soares and M.Fatima
Bonaldo. FT /sexmale FT
/organismHomo sapiens FT
/cloneIMAGE1851134 FT
/clone_libSoares_fetal_liver_spleen_1NFLS_S1 FT
/dev_stage20 week-post
conception fetus FT
/lab_hostDH10B (ampicillin resistant)
10
EST sequence quality

Single, unverified runs, so quality is on average
low
Current submission rules require Phred score gt
20 (lt1 error) documented
Trivial contaminants are common (vector, rRNA,
mitRNA, other species )
Frameshift errors are very common
For many reads, trace files can be retrieved

11
Private EST databases

Producing and selling access to EST data has
proven to be a lucrative business
Incyte has created the LifeSeq databases, to
which access can be bought
Human Genome Sciences has preferred to exploit
the data itself, and to get patents on promising
genes found in its databases

12
Why search EST databases?

Alternative to library screening a short tag can
lead to a cDNA clone
Alternative to cDNA sequencing sequences of
multiple ESTs can reconstitute a full-length cDNA
Gene discovery EST sequences often contain new
members of known families

13
BLAST searching EST databases

Many sites where EST databases can be searched
with BLAST, including our own
As EST sections of EMBL/GenBank are not separated
by species, one should choose server carefully
(e.g. NCBI has human, mouse, others, EBI has no
species choice)
Searching large EST collections with a protein
query (tblastn) can take very long...

14
Problems with raw EST databases

The databases are highly redundant e.g. 1.7x106
human sequences for 105 genes
The databases are skewed for sequences near 3
end of mRNAs
The error rates are high in individual ESTs
For most ESTs, there is no indication as to the
gene from which is was derived

15
The ORESTES project

Goal to obtain EST sequences from the
under-represented, often coding, central portions
of mRNAs
Methodology use low-stringency semi-random
priming followed by PCR, producing low complexity
libraries
Results over 250000 ESTs produced, of which
half produce novel information

16
How to organize EST collections?

Clustering associate individual EST sequences
with unique transcripts or genes
Assembling derive consensus sequences from
overlapping ESTs belonging to the same cluster
Mapping associate ESTs (or EST contigs) with
exons in genomic sequences
Interpreting find and correct coding regions

17
The Unigene database

Unigene is an ongoing effort at NCBI to cluster
EST sequences with traditional gene sequences
For each cluster, there is a lot of additional
information included
Unigene is regularly rebuilt. Therefore, cluster
identifiers are not stable gene indices

18
The TIGR Gene Indices

The Gene Indices are based on contig assemblies
rather than true clustering strategies
There are more Gene Indices than Unigene clusters
TIGR maintains EGAD, a database of known human
transcripts
EGAD entries are included in the Human Gene
Indices

19
Other EST cluster and assembly databases

SANBI in South Africa produces the STACK
collection of human EST contigs
MIPS in Munich and the SIB produce
BLAST-searchable contigs from Unigene
TIGEM in Italy has a nice collection of EST
search and assembly tools, local remote
The CBIL at the U. of Pennsylvania has assembled
the DOTS database

20
Strategy for gene discovery

Cluster EST sequences to identify candidate
transcripts
Assemble to increase length and reduce redundancy
Find coding regions and reading frame, and
correct frameshift error
Use deduced protein sequences as searchable
database (TrEST)

21
ESTScan

A program to recognize and correct coding regions
in EST sequences

22
Design goals

Discrimination between coding and non-coding
sequences
Detection of beginning and end of coding regions
Correction of frameshift errors
Tolerance to artefactual stop codons

23
Methodology

Detection of coding regions use known bias in
hexanucleotide frequencies
Implemented as a fifth-order, inhomogeneous
Markov model (taken from C. Burge and S. Karlin)
Modify the model to accommodate insertions and
deletions

24
Markov models

Markov model probability of occurrence of a
symbol at one position depends on values of
preceding positions
Hidden Markov model the observed sequence is
produced from a hidden Markov model whose states
emit symbols with different probabilities

25
Types of Markov models
a) Three periodic fifth order Markov modelb)
Homogeneous fifth order Markov modelc) Hidden
Markov modelBurge Karlin, Curr. Opinion
Struct. Biol. 8346, 1998
26
The ESTScan HMM
27
Normalization

Goal establish score boundaries between coding
and non-coding regions
Problem raw scores are dependent on both length
and GC content
Approach establish an empirical system based on
observed behavior of the algorithm

28
Normalization strategy

Create pseudo-EST databases devoid of coding
potential, with entries of variable length
Split these on the basis of GC content
Look at score distributions, in terms of the
proportion of false positives above a given
cutoff

29
Behavior of scoring system
Cutoff values used for score normalization. Shuffl
ed EST sequence databases with entries of uniform
length were produced as described. The isochore
classes were I lt43 GC II 43-51, III
51-57, and IV gt 57. The scores for all entries
in each database were calculated, and the cutoff
score for false positive rates of f calculated.
The actual table used by ESTScan contains a
larger number of entries, based on more values
for f and length.
30
Effects of normalization

Normalized score log(raw/cutoff) x 100
Positive scores denote coding regions, negative
non-coding
The acceptable level of false positives can be
chosen by the user

31
ESTScan output
The output header line contains the following
informationgtxxyyyyyzzzzz Normalized_score
Raw_score Cutoff Start_coding Stop_coding
gtemblAI740923accAI740923 39.6 376 253 1 297
Homo sapienswg18d05.x1 Soares_NSF_F8_9W_OT_PA_P
_S1 Homo sapiens cDNA clone IMAGE2365449 3
similar to TRQ15597 Q15597 TRANSLATION
INITIATION FACTOR EIF-4GAMMA TAVIKQRVPILLKYLDSDTE
KELQALYALQASIVKLDQPANLLRMFFDCLYDEEVISEDA FYKWESSKD
PAEQNGKGVALKSVTAFFTWLREAEEESED ACTGCTGTTATCAAGCAG
AGAGTGCCGATCTTACTCAAGTACCTAGACTCAGATACAGAG AAGGAAC
TGCAAGCACTTTATGCACTACAAGCATCGATAGTAAAACTTGATCAACCT
GCC AATTTGCTGCGGATGTTTTTTGATTGTCTATATGACGAGGAGGTGA
TCTCCGAGGATGCC TTCTACAAATGGGAGAGCAGCAAGGACCCTGCAGA
GCAGAATGGGAAGGGCGTGGCTCTG AAATCTGTCACGGCATTCTTCACG
TGGCTGCGGGAAGCAGAAGAGGAGTCTGAGGATaac taaaacttcaaat
acccaaaatgaaacaaaagaaacaatttaagtatttttttaaaaaag tt
tcacgtcttcgccaatcacagtgcagcaaggccaattctcgcagaaaccc
ccacgtgt gcacgagtgggagaggggaaagagaaaaaaaggtgatcatg
gaggaaaaaggtactggat aaaagtaaacttcaaaccttagggcgggag
cactaaaaccaaaaaaaaaaaa
32
Test data for CDS recognition

Coding regions isochore-separated sets from
GENIO
3UTR annotated regions from human genes,
extracted by SRS
ESTs identified by BLAST (5 best scores), and
trimmed by xblast

33
CDS recognition tests

For each sequence in test set, calculate
normalized score on both strands and report best
of two
Plot data as regular or cumulative frequency
histograms
Estimate discriminant power from overlap between
test sets

34
CDS recognition - results
CDS
EST
35
CDS recognition - improvements

Introduce explicit models for non-coding regions,
especially 3 UTRs
Introduce a translation initiation recognition
module
Re-optimize the HMMs CDS scoring parameters on
EST data

36
Frame recognition and correction

Test set CDS extracted from all human
non-Unigene ESTs by ESTScan
Compare results of blastp against this set and
tblastn against same EST collection (32 fold
larger)
Result equivalent of seq. lost by ESTScan or
added to match list because of frame corrections

37
Frame recognition and correction

Informal testing took 2 sequences known to
contain frameshift errors
First contains frameshift near 3 end that does
not affect length of CDS - corrected by ESTScan
Second is EST with multiple sequencing errors
including two frameshifts

38
Uses of ESTScan

Flag ESTs that contain CDS for further analysis
(e.g. in ORESTES)
Create virtual protein sequence databases from
ESTs or EST contigs
Detect and correct reading frame during cDNA
sequencing (e.g. in SEREX project)

39
ESTScan - the future

Include non-CDS and translation initiation
recognition
Coding exon recognition in low-quality genomic
sequences
Produce parameter and normalization tables for
more species (e.g. Drosophila)
Integrate with contig assembly

40
Using EST data for gene expression analysis

Basic idea compare relative abundance of
specific cDNAs in libraries of different origins
Best tools available Digital Differential
Display (DDD) and xProfiler at the CGAP Web site
Alternative browse through genes unique to
different libraries

Write a Comment

User Comments (0)