Title: Advanced Bioinformatics,
1 gene prediction in ENCODE roderic guigó i
serra crg-imim-upf, barcelona
- Advanced Bioinformatics,
- chsl, october 2005
2(No Transcript)
3- 1 of the genome. 44 regions
- target selection. commitee to select sequence
targets - manual targets a lot of information
- radom targets stratified by non exonic
conservation with mouse gene density
4(No Transcript)
5DNase Hypersensitive Sites
DNA Replication
Epigenetic ?
Genes and Transcripts
Cis-regulatory elements (promoters, transcription
factor binding sites)
Long-range regulatory elements (enhancers,
repressors/silencers, insulators)
6gencode encyclopedia of genes and gene variants
- identify all protein coding genes in the ENCODE
regions - identify one complete mRNA sequence for at least
one splice isoform of each protein coding gene. - eventually, identify a number of additional
alternative splice forms.
- Roderic Guigó, IMIM-UPF-CRG
- Stylianos Antonarakis, GeneveAlexandre Reymond
- Ewan Birney, EBI
- Michael Brent, WashU
- Lior Pachter, Berkeley
- Manolis Dermitzkakis, Sanger
- Jennifer Ashurst, Tim Hubbard
7the gencode annotation pipelinemanual curation
havana (sanger)experimental verificationgenev
abioinformatics imim
8comparison with other gene sets
ALL EXONS
CODING EXONS
9from the encode Cromatin and Replication Group,
John Stamatoyannopoulos
10one gene - many proteinsvery complex
transcription units
11chimering tandem transcription / intergenic
splicing
12KUA and UEV, Thomson et al., Genome Research 2000
13systematic search for functional chimeras in
ENCODE165 tandem pairs in the same
orientation126 chimeric predictions obtained96
tested, at least 4 positve
Parra et al., Genome Research in press
14EGASP05
- the complete annotation of 13 regions was
released in january 30. - The annotation of the remaining 31 regions was
being obtained, and it was withheld. - gene prediction groups were asked to submit
predictions by april 15 in the remaining 31
regions. - 18 groups participated, submiting 30 prediction
sets - predictions were compared to the annoations in an
NHGRI sponsored workshop at the Wellcome Trust
Sanger Institute, on may 6 and 7.
15(No Transcript)
16(No Transcript)
17EGASP05
- two main goals
- to assess how automatic methods are able to
reproduce the (costly) manual/computational/experi
mental gencode annotation - how complete is the gencode annotation. are there
still genes consistenly predicted by
computational methods
18(No Transcript)
19(No Transcript)
20accuracy measures
21accuracy at the exon level --coding exons
- 18 groups participated submitting 30 prediction
sets
evidence-based dual genome ab intio
22accuracy at the exon level --all exons
- 18 groups participated submitting 30 prediction
sets
evidence-based dual genome ab intio
23programs are quite good at calling the protein
coding exons (accuracy at 80) Not as good at
calling the transcribed exons), but the best of
the programs predict correctly only 40 of the
complete CDS exonic structures, andin about 30
of the cases, they are able to predict correctly
none of the CDS exonic structures
24programs are quite good at calling the protein
coding exons (accuracy at 80) Not as good at
calling the transcribed exons), but the best of
the programs predict correctly only 40 of the
complete transcripts (considering only the coding
fraction)in about 30 of the cases, they are
able to predict correctly none of the CDS exonic
structures
25the issue of completness
26many novel exons predictedwe will prioritize a
few hundred for experimental verification using
race rt-pcr although our experiment in the 13
regions suggests that only a few of them are
likely to be real
27many computational predictions outside of the
annotation
In 13 ENCODE regions 1255 unique predicted
introns (exon pairs) in one or more of the 9
UCSC gene prediction tracks are not
annotated 334 (27) are outside annotations
(could correspond to novel genes)
28many computational predictions outside of the
annotation
In 13 ENCODE regions 1255 unique predicted
introns (exon pairs) in one or more of the 9
UCSC gene prediction tracks are not
annotated 334 (27) are outside annotations
(could correspond to novel genes) all tested by
rt-pcr on 24 tissues 25 (2.0) confirmed by
rt-pcr in 24 tissues 16 (1.2) with correctly
predicted intron junctions 3 (0.2) outside
annotations (1 confirmation)
29Overview of the verification efforts
IIAFFX-GenCode novel regions
- 40 intergenic transfrags from HL60 cell line that
overlap GenCode gene predictions - 20 overlapping gene predictions with no
verification attempted by GenCode - 20 overlapping gene predictions where
verification by GenCode was negative - 40 intergenic GenCode gene predictions that do
not overlap HL60 transfrags - 20 where no verification was attempted by GenCode
- 20 where verification by GenCode was negative
- (slide by Phil Kaphranov, Affymetrix)
30Some preliminary stats on the 80 regions 3 RACE
only
- Gene predictions overlapping transfrags total 39
(1/40 is a duplicated transfrag) - 27 (69) are positive in HL60 and 31(80) in
HepG2 in the 3 RACE assays - (slide by Phil Kaphranov, Affymetrix)
- Gene predictions not overlapping transfrags
total 38 (2/40 are outside of the regions where
we have probes on the ENCODE array) - 18 (47) are positive in HL60 and 25 (66) in
HepG2 in the 3 RACE assays
313 RACE based on a predicted exon
ENr131_egasp_224555_224677 identifiesnew major
and minor exons (shown by arrows) of a gene
BC042133 in HepG2 cell line only. Good
correspondence between RACE exons and GenScan
exons.
GenScan
HepG2 3RACE Bottom strand
HepG2 3RACE Top strand
32high-throughput genome-wide unbiased
transcription interrogation techniques
the encode genes and transcripts
group transfrags, Tom Gingeras (Affymetrix) and
Mike Snyder (Yale) cage tags, Albin Sandelin,
Riken ditags Yijun Ruan, Genome Insitute of
Singapore
33Proteasome (prosome, macropain) 26S subunit,
non-ATPase, 4
(inhibits cholera-induced intestinal fluid
secretion)
Chrom 2
34protein coding genes are only a fraction of the
transcription detected in ENCODE
35transcription (aparently) not associated to
protein coding genes
TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME
COURSE (data by Tom Gingeras, affymerix)
36inferring novel protein coding genes from
transfrags
THREADING TRANSFRAGS into PROTEIN CODING GENES
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51http//genome.imim.es/gencode
HAVANA (Sanger) Jennifer Ashurst Tim Hubbard Adam
Frankish David Swarbreck James Gilbert AFFYMETRIX
Tom Gingeras Sujit Dike Phil Kaphranov EGASP05
Michael Ashburner Vladimir Bajic Suzanne
Lewis Martin Reese Peter Good Elise Feingold
ENCODE France Denoeud (IMIM) Julien Lagarde Josep
F. Abril Robert Castelo Eduardo Eyras Stylianos
Antonarakis (Geneva) Alexandre Reymond Catherine
Ucla Ewan Birney (EBI) Damian Keefe Paul
Fliceck Michael Brent (WashU) Lior Patcher
(Berkeley) Manolis Dermitakis (Sanger)