Advanced Bioinformatics, - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Bioinformatics,

Description:

accuracy at the exon level -- coding exons. 18 groups participated submitting ... 1255 unique predicted introns (exon pairs) in one or more of the 9 UCSC gene ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 48
Provided by: rode99
Category:

less

Transcript and Presenter's Notes

Title: Advanced Bioinformatics,


1

gene prediction in ENCODE roderic guigó i
serra crg-imim-upf, barcelona
  • Advanced Bioinformatics,
  • chsl, october 2005

2
(No Transcript)
3
  • 1 of the genome. 44 regions
  • target selection. commitee to select sequence
    targets
  • manual targets a lot of information
  • radom targets stratified by non exonic
    conservation with mouse gene density

4
(No Transcript)
5
DNase Hypersensitive Sites
DNA Replication
Epigenetic ?
Genes and Transcripts
Cis-regulatory elements (promoters, transcription
factor binding sites)
Long-range regulatory elements (enhancers,
repressors/silencers, insulators)
6
gencode encyclopedia of genes and gene variants
  • identify all protein coding genes in the ENCODE
    regions
  • identify one complete mRNA sequence for at least
    one splice isoform of each protein coding gene.
  • eventually, identify a number of additional
    alternative splice forms.
  • Roderic Guigó, IMIM-UPF-CRG
  • Stylianos Antonarakis, GeneveAlexandre Reymond
  • Ewan Birney, EBI
  • Michael Brent, WashU
  • Lior Pachter, Berkeley
  • Manolis Dermitzkakis, Sanger
  • Jennifer Ashurst, Tim Hubbard

7
the gencode annotation pipelinemanual curation
havana (sanger)experimental verificationgenev
abioinformatics imim
8
comparison with other gene sets
ALL EXONS
CODING EXONS
9
from the encode Cromatin and Replication Group,
John Stamatoyannopoulos
10
one gene - many proteinsvery complex
transcription units
11
chimering tandem transcription / intergenic
splicing
12
KUA and UEV, Thomson et al., Genome Research 2000
13
systematic search for functional chimeras in
ENCODE165 tandem pairs in the same
orientation126 chimeric predictions obtained96
tested, at least 4 positve
Parra et al., Genome Research in press
14
EGASP05
  • the complete annotation of 13 regions was
    released in january 30.
  • The annotation of the remaining 31 regions was
    being obtained, and it was withheld.
  • gene prediction groups were asked to submit
    predictions by april 15 in the remaining 31
    regions.
  • 18 groups participated, submiting 30 prediction
    sets
  • predictions were compared to the annoations in an
    NHGRI sponsored workshop at the Wellcome Trust
    Sanger Institute, on may 6 and 7.

15
(No Transcript)
16
(No Transcript)
17
EGASP05
  • two main goals
  • to assess how automatic methods are able to
    reproduce the (costly) manual/computational/experi
    mental gencode annotation
  • how complete is the gencode annotation. are there
    still genes consistenly predicted by
    computational methods

18
(No Transcript)
19
(No Transcript)
20
accuracy measures
21
accuracy at the exon level --coding exons
  • 18 groups participated submitting 30 prediction
    sets

evidence-based dual genome ab intio
22
accuracy at the exon level --all exons
  • 18 groups participated submitting 30 prediction
    sets

evidence-based dual genome ab intio
23
programs are quite good at calling the protein
coding exons (accuracy at 80) Not as good at
calling the transcribed exons), but the best of
the programs predict correctly only 40 of the
complete CDS exonic structures, andin about 30
of the cases, they are able to predict correctly
none of the CDS exonic structures
24
programs are quite good at calling the protein
coding exons (accuracy at 80) Not as good at
calling the transcribed exons), but the best of
the programs predict correctly only 40 of the
complete transcripts (considering only the coding
fraction)in about 30 of the cases, they are
able to predict correctly none of the CDS exonic
structures
25
the issue of completness
26
many novel exons predictedwe will prioritize a
few hundred for experimental verification using
race rt-pcr although our experiment in the 13
regions suggests that only a few of them are
likely to be real
27
many computational predictions outside of the
annotation
In 13 ENCODE regions 1255 unique predicted
introns (exon pairs) in one or more of the 9
UCSC gene prediction tracks are not
annotated 334 (27) are outside annotations
(could correspond to novel genes)
28
many computational predictions outside of the
annotation
In 13 ENCODE regions 1255 unique predicted
introns (exon pairs) in one or more of the 9
UCSC gene prediction tracks are not
annotated 334 (27) are outside annotations
(could correspond to novel genes) all tested by
rt-pcr on 24 tissues 25 (2.0) confirmed by
rt-pcr in 24 tissues 16 (1.2) with correctly
predicted intron junctions 3 (0.2) outside
annotations (1 confirmation)
29
Overview of the verification efforts
IIAFFX-GenCode novel regions
  • 40 intergenic transfrags from HL60 cell line that
    overlap GenCode gene predictions
  • 20 overlapping gene predictions with no
    verification attempted by GenCode
  • 20 overlapping gene predictions where
    verification by GenCode was negative
  • 40 intergenic GenCode gene predictions that do
    not overlap HL60 transfrags
  • 20 where no verification was attempted by GenCode
  • 20 where verification by GenCode was negative
  • (slide by Phil Kaphranov, Affymetrix)

30
Some preliminary stats on the 80 regions 3 RACE
only
  • Gene predictions overlapping transfrags total 39
    (1/40 is a duplicated transfrag)
  • 27 (69) are positive in HL60 and 31(80) in
    HepG2 in the 3 RACE assays
  • (slide by Phil Kaphranov, Affymetrix)
  • Gene predictions not overlapping transfrags
    total 38 (2/40 are outside of the regions where
    we have probes on the ENCODE array)
  • 18 (47) are positive in HL60 and 25 (66) in
    HepG2 in the 3 RACE assays

31
3 RACE based on a predicted exon
ENr131_egasp_224555_224677 identifiesnew major
and minor exons (shown by arrows) of a gene
BC042133 in HepG2 cell line only. Good
correspondence between RACE exons and GenScan
exons.
GenScan
HepG2 3RACE Bottom strand
HepG2 3RACE Top strand
32
high-throughput genome-wide unbiased
transcription interrogation techniques
the encode genes and transcripts
group transfrags, Tom Gingeras (Affymetrix) and
Mike Snyder (Yale) cage tags, Albin Sandelin,
Riken ditags Yijun Ruan, Genome Insitute of
Singapore
33
Proteasome (prosome, macropain) 26S subunit,
non-ATPase, 4
(inhibits cholera-induced intestinal fluid
secretion)
Chrom 2
34
protein coding genes are only a fraction of the
transcription detected in ENCODE
35
transcription (aparently) not associated to
protein coding genes
TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME
COURSE (data by Tom Gingeras, affymerix)
36
inferring novel protein coding genes from
transfrags
THREADING TRANSFRAGS into PROTEIN CODING GENES
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
http//genome.imim.es/gencode
HAVANA (Sanger) Jennifer Ashurst Tim Hubbard Adam
Frankish David Swarbreck James Gilbert AFFYMETRIX
Tom Gingeras Sujit Dike Phil Kaphranov EGASP05
Michael Ashburner Vladimir Bajic Suzanne
Lewis Martin Reese Peter Good Elise Feingold
ENCODE France Denoeud (IMIM) Julien Lagarde Josep
F. Abril Robert Castelo Eduardo Eyras Stylianos
Antonarakis (Geneva) Alexandre Reymond Catherine
Ucla Ewan Birney (EBI) Damian Keefe Paul
Fliceck Michael Brent (WashU) Lior Patcher
(Berkeley) Manolis Dermitakis (Sanger)
Write a Comment
User Comments (0)
About PowerShow.com