Exploring the transcriptome - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Exploring the transcriptome

Description:

Systematic sampling of the transcribed portion of the genome ('transcriptome' ... CC This clone is available royalty-free through LLNL ; contact the ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 48
Provided by: vict109
Category:

less

Transcript and Presenter's Notes

Title: Exploring the transcriptome


1
Exploring the transcriptome
  • From transcripts to genes

2
Transcriptome sequencing
  •  Traditional  sequencing
  • cDNA clones isolated on the basis of some
    functional property of interest to a group
  • EST sequencing
  • Large-scale sampling of end sequences of all cDNA
    clones present in a library
  •  Full-length  sequencing
  • Systematic attemps to obtain high-quality
    sequences of cDNA clones representing all
    transcribed genes

3
Why EST sequencing?
  • Systematic sampling of the transcribed portion of
    the genome (transcriptome)
  • Provides sequence tags allowing unique
    identification of genes (e.g. for SAGE)
  • Provides experimental evidence for the positions
    of exons
  • Provides regions coding for potentially new
    proteins
  • Provides clones for DNA microarrays

4
ESTs - the basics
  • cDNA libraries prepared from various tissues and
    cell lines, using directional cloning
  • Gridding of individual clones using robots
  • For each clone, sequencing of both ends of insert
    in single pass
  • Deposit readable part of sequence in database

5
cDNA libraries
  • Most are native, meaning that clone frequency
    reflects mRNA abundance
  • Most are primed with oligo(dT), meaning that 3
    ends are heavily represented
  • The complexity of libraries is extremely variable
  • Normalized libraries are used to enrich for
    rare mRNAs

6
cDNA libraries used
  • Large number of libraries represented
  • Most libraries managed by the IMAGE consortium
  • Human and mouse libraries most abundantly
    represented at IMAGE
  • Systematic effort to make libraries from
    cancerous tissue CGAP project (NCI)
  • Many tissues still not sampled
  • Quality very uneven

7
Clone availability
  • In principle, all clones produced by IMAGE are
    publicly available
  • Distributors ATCC, Incyte and Invitrogen in US,
    HGMP (UK) and RZPD (Germany) in Europe
  • Error rate is high about 30 chance that clone
    doesnt have expected sequence
  • Invitrogen sells sets of sequence verified clones

8
EST databases
  • EMBL/GenBank have separate sections for EST
    sequences
  • ESTs are the most abundant entries in the
    databases (gt60)
  • ESTs are now separated by division in the
    databases
  • EMBL human, mouse, plant, prokaryote, etc.
  • EST sequences are submitted in bulk, but do have
    to meet minimal quality criteria

9
EST entries in EMBL (1)
ID AI242177 standard RNA EST 581 BP. AC
AI242177 SV AI242177.1 DT 05-NOV-1998 (Rel.
57, Created) DT 03-MAR-2000 (Rel. 63, Last
updated, Version 3) DE qh81g08.x1
Soares_fetal_liver_spleen_1NFLS_S1 Homo sapiens
cDNA DE clone IMAGE1851134 3' similar to
gbM10988 TUMOR NECROSIS FACTOR DE PRECURSOR
(HUMAN), mRNA sequence. RN 1 RP 1-581 RA
NCI-CGAP RT National Cancer Institute, Cancer
Genome Anatomy Project (CGAP), Tumor RT Gene
Index http//www.ncbi.nlm.nih.gov/ncicgap RL
Unpublished. DR RZPD IMAGp998P154529
IMAGp998P154529. CC On May 19, 1998 this
sequence version replaced gi2846208. CC
Contact Robert Strausberg, Ph.D. CC Tel (301)
496-1550 CC Email Robert_Strausberg_at_nih.gov CC
This clone is available royalty-free through
LLNL contact the CC IMAGE Consortium
(info_at_image.llnl.gov) for further information. CC
Insert Length 1280 Std Error 0.00 CC Seq
primer -40UP from Gibco CC High quality
sequence stop 463.
10
EST entries in EMBL (2)
FH Key Location/Qualifiers FH FT
source 1..581 FT
/db_xreftaxon9606 FT
/db_xrefESTLIB452 FT
/db_xrefRZPDIMAGp998P154529 FT
/noteOrgan Liver and Spleen Vector pT7T3D
(Pharmacia) FT with a modified
polylinker Site_1 Pac I Site_2 Eco RI FT
This is a subtracted version of the
original Soares fetal FT liver
spleen 1NFLS library. 1st strand cDNA was
primed FT with a Pac I -
oligo(dT) primer 5' FT
AACTGGAAGAATTAATTAAAGATCTTTTTTTTTTTTTTTTTTT
3', FT double-stranded cDNA
was ligated to Eco RI adaptors FT
(Pharmacia), digested with Pac I and cloned
into the Pac I FT and Eco RI
sites of the modified pT7T3 vector. Library FT
went through one round of
normalization. Library FT
constructed by Bento Soares and M.Fatima
Bonaldo. FT /sexmale FT
/organismHomo sapiens FT
/cloneIMAGE1851134 FT
/clone_libSoares_fetal_liver_spleen_1NFLS_S1 FT
/dev_stage20 week-post
conception fetus FT
/lab_hostDH10B (ampicillin resistant)
11
EST sequence quality
  • Single, unverified runs, so quality is on average
    low (and worsening towards 3)
  • Current submission rules require Phred score gt
    20 (lt1 error) documented
  • Trivial contaminants are common (vector, rRNA,
    mitRNA, other species )
  • Frameshift errors are very common
  • For many reads, trace files can be retrieved

12
Private EST databases
  • Producing and selling access to EST data has
    proven to be a lucrative business
  • Incyte has created the LifeSeq databases, to
    which access can be bought
  • Human Genome Sciences has preferred to exploit
    the data itself, and to get patents on promising
    genes found in its databases

13
Why search EST databases?
  • Alternative to library screening a short tag can
    lead to a cDNA clone
  • Alternative to cDNA sequencing sequences of
    multiple ESTs can reconstitute a full-length cDNA
  • Gene discovery EST sequences often contain new
    members of known families

14
BLAST searching EST databases
  • Many sites where EST databases can be searched
    with BLAST, including our own
  • Searching raw EST data is becoming less
    attractive, as for many species ESTs have been
    incorporated into transcript assemblies
  • There are at present no  consensus  transcript
    assemblies

15
Problems with raw EST databases
  • The databases are highly redundant e.g. 5x106
    human sequences for 4x104 genes
  • The databases are skewed for sequences near 3
    end of mRNAs
  • The error rates are high in individual ESTs
  • For most ESTs, there is no indication as to the
    gene from which is was derived

16
The ORESTES project
  • Goal to obtain EST sequences from the
    under-represented, often coding, central portions
    of mRNAs
  • Methodology use low-stringency semi-random
    priming followed by PCR, producing low complexity
    libraries
  • Results over 1000000 ESTs produced, of which
    half produce novel information

17
How to organize EST collections?
  • Clustering associate individual EST sequences
    with unique transcripts or genes
  • Assembling derive consensus sequences from
    overlapping ESTs belonging to the same cluster
  • Mapping associate ESTs (or EST contigs) with
    exons in genomic sequences
  • Interpreting find and correct coding regions

18
The Unigene database
  • Unigene is an ongoing effort at NCBI to cluster
    EST sequences with traditional gene sequences
  • For each cluster, there is a lot of additional
    information included
  • Unigene is regularly rebuilt. Therefore, cluster
    identifiers are not stable gene indices

19
The TIGR Gene Indices
  • The Gene Indices are based on contig assemblies
    rather than true clustering strategies
  • There are more Gene Indices than Unigene clusters
  • TIGR maintains EGAD, a database of known human
    transcripts
  • EGAD entries are included in the Human Gene
    Indices

20
Other EST cluster and assembly databases
  • SANBI in South Africa produces the STACK
    collection of human EST contigs
  • MIPS in Munich and the SIB produce
    BLAST-searchable contigs from Unigene
  • TIGEM in Italy has a nice collection of EST
    search and assembly tools, local remote
  • The CBIL at the U. of Pennsylvania has assembled
    the DOTS database, part of the AllGenes project

21
Strategy for gene discovery
  • Cluster EST sequences to identify candidate
    transcripts
  • Assemble to increase length and reduce redundancy
  • Find coding regions and reading frame, and
    correct frameshift error
  • Use deduced protein sequences as searchable
    database (TrEST)

22
Interpreting ESTs in the context of the genome
  • The genome sequence provides a reliable scaffold
    for clustering ESTs
  • ESTs clustered on the genome can be assembled by
    following exon/intron boundaries
  • The combination of genome, full-length insert and
    EST data will provide the reference dataset for
    gene and transcript annotation

23
Using EST data for gene expression analysis
  • Basic idea compare relative abundance of
    specific cDNAs in libraries of different origins
  • Best tools available Digital Gene Expression
    Displayer (DGED) and xProfiler at the CGAP Web
    site
  • Alternative browse through genes unique to
    different libraries

24
Full-length cDNA sequencing
  • Principle make high-quality cDNA libraries,
    often optimized for full-length inserts
  • Do first-pass sequencing of 3 and 5 ends, and
    select good clones one per transcript,
    contains initiation codon and poly(A)
  • For selected clones, fully sequence the inserts

25
Full-length sequencing projects
  • MGC (at NCI) goal of 20000 sequences a year
    starting in 2001, human and mouse actually
    produced 15000 human and 11000 mouse
  • RIKEN Genome Science Lab, Tsukuba produced and
    annotated 60000 full-length mouse sequences
    (FANTOM2)
  • Kazusa Institute HUGE project set of over 2000
    KIAA human cDNA sequences (large proteins)
  • NEDO human cDNA project over 20000 FLJ
    sequences available in early 2003
  • German cDNA consortium (DKFZ) produced over 3200
    sequences, current state of project unclear

26
The future of ESTs
  • In human and mouse, most will come as byproducts
    of full-length projects,
  • There are good arguments for trying to reach
    saturation on selected tissues
  • ESTs are still the tool of choice for rapid
    exploration of the transcriptomes of various
    species, especially with large genomes
  • ESTs could form a very solid basis for
    evolutionary studies

27
The need for a gene index
  • All high-throughput biology methods require a
    unique and reliable way to describe the genes
    they are analysing
  • This index should be stable, unique, extensible,
    and independent of a system of nomenclature
  • The index should document all transcript
    sequences belonging to the corresponding gene

28
Some commonly used indices
  • EMBL/GenBank/DDBJ accession numbers
  • Unique and universally accepted BUT
  • Highly redundant (many entries per gene)
  • Unigene cluster identifiers (NCBI)
  • Widely used and non-redundant BUT
  • Rely on clustering procedure (unreliable) AND
  • Unstable clusters change with each build
  • RefSeq accession numbers (NCBI)
  • Stable and non-redundant BUT
  • Still very far from comprehensive AND
  • Many RefSeq sequences are incomplete AND
  • Splice variants are not systematically documented

29
Emerging gene indices
  • HUGO Gene Nomenclature Committee
  • Assigns unique symbols to human genes with a
    known or putative function
  • About 15000 symbols assigned so far
  • No attempt to link to sequence data or variants
  • Ensembl
  • Uses  conservative  pipeline to predict genes
    and transcripts
  • 25000 predicted genes and 37000 transcripts so
    far
  • Provides unique identifiers for genes and
    transcripts
  • Linked to HGNC names when possible

30
Gene documentation resources
  • Repositories collecting info about loci from
    multiple sources
  • LocusLink (NCBI) links mostly to NCBI internal
    information sources
  • GeneCards (Weizmann Inst.) uses Web robots to
    collect information, probably most complete
    resource available today
  • GeneLynx (Karolinska Inst.) similar to
    GeneCards, but also collects EST information

31
 tromer  goals
  • To combine the draft genome and transcriptome
    into a database documenting all human transcripts
  • To design a rational representation and indexing
    system for human genes
  • To reconstitute transcripts and their variants
    using genomic DNA as template

32
Methods
  • Create a comprehensive and non-redundant set of
    mRNA 3 ends, documented by polyadenylation sites
    mapped to the genome
  • Create a comprehensive mapping of EST sequences
    to exons on the genome
  • Use this to create a database of exons and
    introns, defined by their coordinates on the
    genome
  • Combine these two data sets to reconstitute
    transcripts

33
Poly(A) site documentation
  • Re-extract EST sequences from trace files (when
    available)
  • Extract 50 nt adjacent to poly(A)
  • Make this collection non-redundant
  • Map all tags to the draft genome
  • Validate the quality of the tags

34
Generation of the 3 tags
35
Example of a  bad  tag
ID 3P016851 SQ 1 0 AGCACTGATGTTGAATTTCACTTTA
AGTTGTCACATGGTCTGAGTTGTACaaaaagaaagaaaaattctggccgg
gtgtggtggctcatgcctgtaatcc PA Score-315
matchnegative QQ A13 C0 G2 T0
testbad CM nz76g10.s1AA764893 -, 18 As in
polyA CO qt57h10.x1AI355333 GN AP000848_32
pos 981 CHR 11 GN AP001320_19 pos
2151 CHR 11 GN AC011875_21 pos 1166 CHR
NA //
Example of a  good  tag
ID 3P016857 SQ 1 0 AGCACTGTCTTATCACATCGCCAAT
TAGTTGTAATAAACGTTCAACGTACaaacactggagtgtgtttttatctc
tttccaaaagtttgtcaaactatgc PA Score 117
matchstrong QQ A5 C2 G5 T3 testok CM
nk75b07.s1AA551090 -, 17 As in polyA CO
ob29c06.s1AA741544 qm22a02.x1AI278652
wj07f01.x1AI824173 wm06d04.x1AI887507 CO
xf19g12.x1AW131962 yw01c11.s1H96551
yy23c08.s1N35379 zl39e04.s1AA149587 CO
zq01c05.s1AA193118 zv54a07.s1AA437212 GN
AC004106_1 - pos 34804 CHR X SQ 2 4
CTGTCTTATCACATCGCCAATTAGTTGTAATAAACGTTCAACGTACAAAC
actggagtgtgtttttatctctttccaaaagtttgtcaaactatgcagag
PA Score 134 matchstrong QQ A2 C1 G5
T7 testok CM aa28a05.r1AA480914 -, 16 As
in polyA CO qe01d07.x1AI143977
qu50g08.x1AI285755 yh71a05.s1R32806
yv79c01.r1H82355 CO zn91a12.s1AA127331
zn92a12.s1AA127424 GN AC004106_1 - pos
34800 CHR X //
36
Exon and intron mapping
  • Document all matches between genome contigs and
    RNAs (pairwise matching) using BLAST (gt5x106
    matches)
  • Create pairwise alignments between RNAs and
    genome contigs using sim4, document exon and
    intron boundaries
  • Import data into ACEDB for exploration and
    visualization

37
gtchrAC018734_9AC018734.3 Chromosome 21 Clone
RP11-351D2 SC WashU.Homo sapiens
LEN30562 gtestAW468733AW468733.1-
LEN413 16806-16850 (1-45) 97 -gt 18234-18383
(46-195) 100 -gt 19199-19266 (196-263) 100
-gt 19463-19612 (264-413) 100 0 .
. . . .
16806 CCCGTCGGGAGACACGTTTTGTCAGTTGCTTCCTCTGGAGTCCT
TGTG..
gtgtgt.. 1 CCNGTCGGGAGACACGTTTTGTCAGT
TGCTTCCTCTGGAGTCCTT 50 . .
. . . 16856
.AAGTTAAAAGACTACTTTTTAGAGCAGTTTTCATTTTACCGATACATCG
.gtgtgt
46 TTAAAAGACTACTTTTTAGAGCAGTTTT
CATTTTACCGATACATCG 100 . .
. . . 18280
ATCAGAGAGTACGGAGCTTCCCACATTGCCTCCCGCACCCCATAGCAGGC

92 ATCAGAGAGTACGGAGCTTCCCACATTGCCTC
CCGCACCCCATAGCAGGC 150 . .
. . . 18330
AGAGGCACCAGCCAGGCGCCATCCTTTACAAGCTGTTTCAGCAAAATGTC

142 AGAGGCACCAGCCAGGCGCCATCCTTTACAAG
CTGTTTCAGCAAAATGTC 200 . .
. . . 18380
CCTGGTG...CAGGGTCTTAATGATTAAGTCCAGGCACTCCAAGGACGAT
gtgtgt...gtgtgt
192 CCTG GGTCTTAATGATTAAGTCC
AGGCACTCCAAGGACGAT 250 . .
. . . 19236
TAAATATTTAAACAGCAGCTTCCCTTTGAAGGTC...CAGGGAAGGAAAT
gtgtgt...gtgtgt
233 TAAATATTTAAACAGCAGCTTCCCTTTGAAG
GGAAGGAAAT 300 . .
. . . 19473
GACCGCAGAAGGGAGTGCACCGTGGGGGCAGGAGGTGCTGCGGCCAATGG

274 GACCGCAGAAGGGAGTGCACCGTGGGGGCAGG
AGGTGCTGCGGCCAATGG 350 . .
. . . 19523
GCAGTGCTGGGAGATGAAACCAACAGCGGCGATTGCCCCAGAAAGAACTT

324 GCAGTGCTGGGAGATGAAACCAACAGCGGCGA
TTGCCCCAGAAAGAACTT 400 . .
. . 19573 GGCGCTGCTAAGGCAATGGCGGC
CCGAAAGAGGCCCCTGA
374 GGCGCTGCTAAGGCAATGGCGGCCC
GAAAGAGGCCCCTGA
Output of sim4 program
38
NCAM2
Hs.177691
Hs.135892
DKFZp761I1311
Hs.76118
39
COL18A1
Hs.78409
3 UTR overlap
Hs. 84190
SLC19A1
40
Alternative polyadenylation
  • Affects at least half of all genes
  • Estimate based on manual examination of 3.5 MB
    contig from chr 21
  • Exists at both micro (lt100 bp) and macro (gt100
    bp) levels
  • Often extends over distances of several kilobases
  • Impacts on microarray probe design as well as
    SAGE data interpretation

41
Reconstituting transcripts (1)
  • Data cleanup
  • Decide which transcript to genome matches are
    reliable (gene families, pseudogenes, genome
    repeats, etc)
  • Within alignments, decide which parts are bona
    fide exons
  • Exon generation
  • Determine directionality, based on splice
    junctions and RNA annotations
  • Determine exon boundaries and completeness

42
Reconstituting transcripts (2)
  • Merge multiple occurrences of same exon,
    respecting boundary rules
  • E.g. overlapping exons with different splicing
    patterns are not merged
  • Merge multiple occurrences of same splice
    junctions
  • Resolve differences due to noise in data from
    real alternative splice junctions

43
A
3tags
9,995
Alu
3,592
3,674
5,734
5,779
6,122
6,152
6,267
6,355
6,577
6,749
8,162
8,505
1
2
3
4
5
6
7
C
B
1
2
3
4
5
6
E
A
E
7
44
Genes as graphs
B0 5600 5/1 gt D0 5/1 D0 6769
5/1 B0 5/1 gt -gt A0 5/1 A0 8755
5/1 D0 5/1 -gt gt D1 5/1 D1 9013
5/1 A0 5/1 gt -gt A1 5/1 B1 10515
2/1 gt D2 2/1 A1 10929 5/1 D1 5/1
-gt gt D2 5/1 D2 11092 7/2 B1 2/1 gt
A1 5/1 gt -gt A2 7/2 A2 11673 10/1
D2 10/1 -gt gt E0 10/1 E0 12486 10/1
3P008703 A2 10/1 gt
  • Nodes are connected by edges
  • Edges are exons or introns
  • Nodes are- B initiation- D splice donor-
    A splice acceptor- E polyadenylation site

B0
D0
A0
D1
B1
A1
D2
A2
E0
45
Reconstituting transcripts (3)
  • Generate graphs for each gene in a genomic
    segment
  • A gene is a fully connected graph
  • Graphs contain nodes and edges linking these
    nodes (exons)
  • Generate the most likely transcripts for each
    gene
  • A transcript is a path through the graph
  • The decision on which paths to keep or reject is
    based on the frequencies with which edges are
    documented

46
Towards a gene index?
  • The graph nodes are sequence based and can easily
    be made non-redundant
  • The connectivity between nodes can be completed
    as more data become available
  • The gene index derived from graphs includes a
    full representation of the locus, including all
    splice variants and all of the primary transcript
    data (ESTs, full-length)

47
What are the uses of the graphs?
  • Solid experimental basis for designing microarray
    probes (3 ends)
  • Mapping of SAGE and MPSS tags
  • Exploration of the diversity of protein coding
    potential (alternative splicing)
  • Gene discovery based on expression patterns,
    either for whole locus or exon by exon
  • Exploration of the complexity of genes
Write a Comment
User Comments (0)
About PowerShow.com