EST sequences and databases Exploring the transcriptome Why - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

EST sequences and databases Exploring the transcriptome Why

Description:

EST sequences and databases Exploring the transcriptome Why EST sequencing? Systematic sampling of the transcribed portion of the genome ( transcriptome ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 41
Provided by: chEmbnetO
Category:

less

Transcript and Presenter's Notes

Title: EST sequences and databases Exploring the transcriptome Why


1
EST sequences and databases
  • Exploring the transcriptome

2
Why EST sequencing?
  • Systematic sampling of the transcribed portion of
    the genome (transcriptome)
  • Provides sequence tags allowing unique
    identification of genes (e.g. for SAGE)
  • Provides experimental evidence for the positions
    of exons
  • Provides regions coding for potentially new
    proteins
  • Provides clones for DNA microarrays

3
ESTs - the basics
  • cDNA libraries prepared from various tissues and
    cell lines, using directional cloning
  • Gridding of individual clones using robots
  • For each clone, sequencing of both ends of insert
    in single pass
  • Deposit readable part of sequence in database

4
cDNA libraries
  • Most are native, meaning that clone frequency
    reflects mRNA abundance
  • Most are primed with oligo(dT), meaning that 3
    ends are heavily represented
  • The complexity of libraries is extremely variable
  • Normalized libraries are used to enrich for
    rare mRNAs

5
cDNA libraries used
  • Currently 2225 libraries represented
  • Most libraries managed by the IMAGE consortium
  • Human (over 300) and mouse (75) libraries most
    abundantly represented at IMAGE
  • Systematic effort to make libraries from
    cancerous tissue CGAP project (NCI)
  • Many tissues still not sampled
  • Quality very uneven

6
Clone availability
  • In principle, all clones produced by IMAGE are
    publicly available
  • Distributors ATCC, Incyte and Research Genetics
    in US, HGMP (UK) and RZPD (Germany) in Europe
  • Error rate is high about 30 chance that clone
    doesnt have expected sequence
  • Research Genetics sells sets of sequence verified
    clones

7
EST databases
  • EMBL/GenBank have separate sections for EST
    sequences
  • ESTs are the most abundant entries in the
    databases (gt60)
  • ESTs are not separated by species in the
    databases
  • EST sequences are submitted in bulk, but do have
    to meet minimal quality criteria

8
EST entries in EMBL (1)
ID AI242177 standard RNA EST 581 BP. AC
AI242177 SV AI242177.1 DT 05-NOV-1998 (Rel.
57, Created) DT 03-MAR-2000 (Rel. 63, Last
updated, Version 3) DE qh81g08.x1
Soares_fetal_liver_spleen_1NFLS_S1 Homo sapiens
cDNA DE clone IMAGE1851134 3' similar to
gbM10988 TUMOR NECROSIS FACTOR DE PRECURSOR
(HUMAN), mRNA sequence. RN 1 RP 1-581 RA
NCI-CGAP RT National Cancer Institute, Cancer
Genome Anatomy Project (CGAP), Tumor RT Gene
Index http//www.ncbi.nlm.nih.gov/ncicgap RL
Unpublished. DR RZPD IMAGp998P154529
IMAGp998P154529. CC On May 19, 1998 this
sequence version replaced gi2846208. CC
Contact Robert Strausberg, Ph.D. CC Tel (301)
496-1550 CC Email Robert_Strausberg_at_nih.gov CC
This clone is available royalty-free through
LLNL contact the CC IMAGE Consortium
(info_at_image.llnl.gov) for further information. CC
Insert Length 1280 Std Error 0.00 CC Seq
primer -40UP from Gibco CC High quality
sequence stop 463.
9
EST entries in EMBL (2)
FH Key Location/Qualifiers FH FT
source 1..581 FT
/db_xreftaxon9606 FT
/db_xrefESTLIB452 FT
/db_xrefRZPDIMAGp998P154529 FT
/noteOrgan Liver and Spleen Vector pT7T3D
(Pharmacia) FT with a modified
polylinker Site_1 Pac I Site_2 Eco RI FT
This is a subtracted version of the
original Soares fetal FT liver
spleen 1NFLS library. 1st strand cDNA was
primed FT with a Pac I -
oligo(dT) primer 5' FT
AACTGGAAGAATTAATTAAAGATCTTTTTTTTTTTTTTTTTTT
3', FT double-stranded cDNA
was ligated to Eco RI adaptors FT
(Pharmacia), digested with Pac I and cloned
into the Pac I FT and Eco RI
sites of the modified pT7T3 vector. Library FT
went through one round of
normalization. Library FT
constructed by Bento Soares and M.Fatima
Bonaldo. FT /sexmale FT
/organismHomo sapiens FT
/cloneIMAGE1851134 FT
/clone_libSoares_fetal_liver_spleen_1NFLS_S1 FT
/dev_stage20 week-post
conception fetus FT
/lab_hostDH10B (ampicillin resistant)
10
EST sequence quality
  • Single, unverified runs, so quality is on average
    low
  • Current submission rules require Phred score gt
    20 (lt1 error) documented
  • Trivial contaminants are common (vector, rRNA,
    mitRNA, other species )
  • Frameshift errors are very common
  • For many reads, trace files can be retrieved

11
Private EST databases
  • Producing and selling access to EST data has
    proven to be a lucrative business
  • Incyte has created the LifeSeq databases, to
    which access can be bought
  • Human Genome Sciences has preferred to exploit
    the data itself, and to get patents on promising
    genes found in its databases

12
Why search EST databases?
  • Alternative to library screening a short tag can
    lead to a cDNA clone
  • Alternative to cDNA sequencing sequences of
    multiple ESTs can reconstitute a full-length cDNA
  • Gene discovery EST sequences often contain new
    members of known families

13
BLAST searching EST databases
  • Many sites where EST databases can be searched
    with BLAST, including our own
  • As EST sections of EMBL/GenBank are not separated
    by species, one should choose server carefully
    (e.g. NCBI has human, mouse, others, EBI has no
    species choice)
  • Searching large EST collections with a protein
    query (tblastn) can take very long...

14
Problems with raw EST databases
  • The databases are highly redundant e.g. 1.7x106
    human sequences for 105 genes
  • The databases are skewed for sequences near 3
    end of mRNAs
  • The error rates are high in individual ESTs
  • For most ESTs, there is no indication as to the
    gene from which is was derived

15
The ORESTES project
  • Goal to obtain EST sequences from the
    under-represented, often coding, central portions
    of mRNAs
  • Methodology use low-stringency semi-random
    priming followed by PCR, producing low complexity
    libraries
  • Results over 250000 ESTs produced, of which
    half produce novel information

16
How to organize EST collections?
  • Clustering associate individual EST sequences
    with unique transcripts or genes
  • Assembling derive consensus sequences from
    overlapping ESTs belonging to the same cluster
  • Mapping associate ESTs (or EST contigs) with
    exons in genomic sequences
  • Interpreting find and correct coding regions

17
The Unigene database
  • Unigene is an ongoing effort at NCBI to cluster
    EST sequences with traditional gene sequences
  • For each cluster, there is a lot of additional
    information included
  • Unigene is regularly rebuilt. Therefore, cluster
    identifiers are not stable gene indices

18
The TIGR Gene Indices
  • The Gene Indices are based on contig assemblies
    rather than true clustering strategies
  • There are more Gene Indices than Unigene clusters
  • TIGR maintains EGAD, a database of known human
    transcripts
  • EGAD entries are included in the Human Gene
    Indices

19
Other EST cluster and assembly databases
  • SANBI in South Africa produces the STACK
    collection of human EST contigs
  • MIPS in Munich and the SIB produce
    BLAST-searchable contigs from Unigene
  • TIGEM in Italy has a nice collection of EST
    search and assembly tools, local remote
  • The CBIL at the U. of Pennsylvania has assembled
    the DOTS database

20
Strategy for gene discovery
  • Cluster EST sequences to identify candidate
    transcripts
  • Assemble to increase length and reduce redundancy
  • Find coding regions and reading frame, and
    correct frameshift error
  • Use deduced protein sequences as searchable
    database (TrEST)

21
ESTScan
  • A program to recognize and correct coding regions
    in EST sequences

22
Design goals
  • Discrimination between coding and non-coding
    sequences
  • Detection of beginning and end of coding regions
  • Correction of frameshift errors
  • Tolerance to artefactual stop codons

23
Methodology
  • Detection of coding regions use known bias in
    hexanucleotide frequencies
  • Implemented as a fifth-order, inhomogeneous
    Markov model (taken from C. Burge and S. Karlin)
  • Modify the model to accommodate insertions and
    deletions

24
Markov models
  • Markov model probability of occurrence of a
    symbol at one position depends on values of
    preceding positions
  • Hidden Markov model the observed sequence is
    produced from a hidden Markov model whose states
    emit symbols with different probabilities

25
Types of Markov models
a) Three periodic fifth order Markov modelb)
Homogeneous fifth order Markov modelc) Hidden
Markov modelBurge Karlin, Curr. Opinion
Struct. Biol. 8346, 1998
26
The ESTScan HMM
27
Normalization
  • Goal establish score boundaries between coding
    and non-coding regions
  • Problem raw scores are dependent on both length
    and GC content
  • Approach establish an empirical system based on
    observed behavior of the algorithm

28
Normalization strategy
  • Create pseudo-EST databases devoid of coding
    potential, with entries of variable length
  • Split these on the basis of GC content
  • Look at score distributions, in terms of the
    proportion of false positives above a given
    cutoff

29
Behavior of scoring system
Cutoff values used for score normalization. Shuffl
ed EST sequence databases with entries of uniform
length were produced as described. The isochore
classes were I lt43 GC II 43-51, III
51-57, and IV gt 57. The scores for all entries
in each database were calculated, and the cutoff
score for false positive rates of f calculated.
The actual table used by ESTScan contains a
larger number of entries, based on more values
for f and length.
30
Effects of normalization
  • Normalized score log(raw/cutoff) x 100
  • Positive scores denote coding regions, negative
    non-coding
  • The acceptable level of false positives can be
    chosen by the user

31
ESTScan output
The output header line contains the following
informationgtxxyyyyyzzzzz Normalized_score
Raw_score Cutoff Start_coding Stop_coding
gtemblAI740923accAI740923 39.6 376 253 1 297
Homo sapienswg18d05.x1 Soares_NSF_F8_9W_OT_PA_P
_S1 Homo sapiens cDNA clone IMAGE2365449 3
similar to TRQ15597 Q15597 TRANSLATION
INITIATION FACTOR EIF-4GAMMA TAVIKQRVPILLKYLDSDTE
KELQALYALQASIVKLDQPANLLRMFFDCLYDEEVISEDA FYKWESSKD
PAEQNGKGVALKSVTAFFTWLREAEEESED ACTGCTGTTATCAAGCAG
AGAGTGCCGATCTTACTCAAGTACCTAGACTCAGATACAGAG AAGGAAC
TGCAAGCACTTTATGCACTACAAGCATCGATAGTAAAACTTGATCAACCT
GCC AATTTGCTGCGGATGTTTTTTGATTGTCTATATGACGAGGAGGTGA
TCTCCGAGGATGCC TTCTACAAATGGGAGAGCAGCAAGGACCCTGCAGA
GCAGAATGGGAAGGGCGTGGCTCTG AAATCTGTCACGGCATTCTTCACG
TGGCTGCGGGAAGCAGAAGAGGAGTCTGAGGATaac taaaacttcaaat
acccaaaatgaaacaaaagaaacaatttaagtatttttttaaaaaag tt
tcacgtcttcgccaatcacagtgcagcaaggccaattctcgcagaaaccc
ccacgtgt gcacgagtgggagaggggaaagagaaaaaaaggtgatcatg
gaggaaaaaggtactggat aaaagtaaacttcaaaccttagggcgggag
cactaaaaccaaaaaaaaaaaa
32
Test data for CDS recognition
  • Coding regions isochore-separated sets from
    GENIO
  • 3UTR annotated regions from human genes,
    extracted by SRS
  • ESTs identified by BLAST (5 best scores), and
    trimmed by xblast

33
CDS recognition tests
  • For each sequence in test set, calculate
    normalized score on both strands and report best
    of two
  • Plot data as regular or cumulative frequency
    histograms
  • Estimate discriminant power from overlap between
    test sets

34
CDS recognition - results
CDS
EST
35
CDS recognition - improvements
  • Introduce explicit models for non-coding regions,
    especially 3 UTRs
  • Introduce a translation initiation recognition
    module
  • Re-optimize the HMMs CDS scoring parameters on
    EST data

36
Frame recognition and correction
  • Test set CDS extracted from all human
    non-Unigene ESTs by ESTScan
  • Compare results of blastp against this set and
    tblastn against same EST collection (32 fold
    larger)
  • Result equivalent of seq. lost by ESTScan or
    added to match list because of frame corrections

37
Frame recognition and correction
  • Informal testing took 2 sequences known to
    contain frameshift errors
  • First contains frameshift near 3 end that does
    not affect length of CDS - corrected by ESTScan
  • Second is EST with multiple sequencing errors
    including two frameshifts

38
Uses of ESTScan
  • Flag ESTs that contain CDS for further analysis
    (e.g. in ORESTES)
  • Create virtual protein sequence databases from
    ESTs or EST contigs
  • Detect and correct reading frame during cDNA
    sequencing (e.g. in SEREX project)

39
ESTScan - the future
  • Include non-CDS and translation initiation
    recognition
  • Coding exon recognition in low-quality genomic
    sequences
  • Produce parameter and normalization tables for
    more species (e.g. Drosophila)
  • Integrate with contig assembly

40
Using EST data for gene expression analysis
  • Basic idea compare relative abundance of
    specific cDNAs in libraries of different origins
  • Best tools available Digital Differential
    Display (DDD) and xProfiler at the CGAP Web site
  • Alternative browse through genes unique to
    different libraries
Write a Comment
User Comments (0)
About PowerShow.com