ENCORE Evaluated NonCoding Region database - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

ENCORE Evaluated NonCoding Region database

Description:

Managed the team in charge of the Fugu genome annotation in Singapore ... Focus on all chordate genomes, include Chicken, Fugu, Zebrafish, Ciona i., Ciona s. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 19
Provided by: elias
Category:

less

Transcript and Presenter's Notes

Title: ENCORE Evaluated NonCoding Region database


1
ENCOREEvaluated Non-Coding Region database
CODE
TransCODE
  • Elia Stupka - elia_at_tigem.it
  • Telethon Institute of Genetics and Medicine

2
My background
  • Part of the Ensembl team (EBI) annotating the
    human genome, Nature 2001
  • Managed the team in charge of the Fugu genome
    annotation in Singapore
  • Designed and developed BioPipe, software for
    large scale protocol-driven analysis
  • Now working on comparative genomics

3
Genome annotation where are we standing?
  • Large effforts have gone into annotating genes
    and gene-related features
  • Enough known genes to spawn reliable gene
    prediction algorithms
  • Large collections of ESTs and cDNAs to refine
    gene structures
  • But what about the rest of the genome?

4
The advent of multiple genomes
  • Since the advent of the mouse genome and others
    there has been an explosion of non-coding
    papers
  • detection of ortholog genes
  • comparison of up- and down-stream sequences
  • test sequences in the laboratory
  • publish!
  • Some of the shortcomings
  • there is no structured annotation of these
    regions
  • little clarity on the methods to be used to
    annotate different types of potential regions
  • strong focus on mammalian genomes

5
The non-coding world
  • There are several features that should be
    detected and annotated in the non-coding world
  • Coding features! (I.e. bits and pieces that have
    been missed)
  • ncRNAs
  • Splice regulators
  • Promoter regulatory regions
  • Enhancers
  • Etc.etc.
  • Each of these features deserves its own analysis
    protocol and annotation!

6
Proposed work
  • We propose to fill the annotation gap
  • Automate large-scale sequence comparisons of
    non-annotated regions
  • Focus on all chordate genomes, include Chicken,
    Fugu, Zebrafish, Ciona i., Ciona s.
  • Differentiate sequence comparison methodology
    based on evolutionary distance
  • Analyze detected regions for potential features
    (e.g. nCRNA, coding, regulatory) and initiate
    appropriate analysis pipeline

7
Gene-centric approach
  • Most current large-scale analysis is
    genome-centric, analyzing chunks of sequences at
    a time
  • We will focus our analyis workflows on single
    genes of interest, or on single gene families
  • Benefits
  • for families of particular interests it gives us
    the ability to curate and tailor-cut the analysis
    (e.g. disease genes)
  • we will obtain results that are closer to the
    biologists view of the data, with a complete
    analysis of their genes of interest, paralogs,
    orthologs, etc.
  • we can tell the biologist how much resemblance
    a gene or gene family bares in a model organism
    genome as compared to the well-studied genomes

8
Workflow design
Coding analysis
ncRNA analysis
Human
Gene of interest
Mouse
Reg. region analysis
Rat
Gene family
Chicken
Synteny Phylogeny
Fugu
Ciona
Conserved Regions
Orthologs and paralogs
ALIGNMENTS
9
Gene family analysis
  • Establish ortholog and paralog relationships in
    all analyzed genomes integrating
  • Phylogenetic analysis
  • Synteny analyisis
  • Manual intervention on curated gene families
  • Annotate genes for conservation of features
    across genomes
  • e.g. show conserved and missing protein domains
    in each genome
  • Immediate view of conservation of function for
    biologist

Map Tree to Genome!
10
Alignments
  • Analyze sequence surrounding orthologs
  • Using 1 Mb on each side of the gene to allow for
    distant enhancers
  • Perform global alignments
  • MLAGAN alignment, particulary useful in mammalian
    genomes and probably chicken
  • Detect conserved regions using varying thresholds
    according to species compared
  • Analyze detected regions further using
    promoterwise in more distant genomes
  • Allows flipping and rearrangement of short
    regions, likely to happen when comparing human to
    fish

11
Why promoterwise?
Human vs. mouse p63 region VISTA plot
12
Analysis of conserved regions
  • Score conserved region
  • Base score on conservation in genomes but
    allowing for the fact that it could be missed
    altogether in some genomes
  • Analyze for potential features such as
  • ncRNAs (gaps in alignments)
  • Coding regions (third base mutation bias)
  • Regulatory regions (clustering of mutations)

13
ncRNA annotation
  • MFOLD structure analysis
  • QRNA and DDBRNA analysis
  • Compare to RFAM to allocate to known RNA families
  • Check for potential antisense activity on
    surrounding genes

14
Coding region annotation
  • If identified as coding, define whether
  • Portion (exon or part of exon) of surrounding
    gene structure mis-annotated
  • Short single-exon gene (should have viable
    promoter, polyA tail,etc)
  • Recent pseudogene, relic of evolution

15
Regulatory region annotation
  • Perform Transfac search to identify known TFBS
  • Analyze all vs. all identified regions to build
    vocabulary of novel potential regulatory regions
  • Build free grammar models of regions conserved
    and over-represented in the genome
  • With specific gene families characterize regions
    in-vivo, use results to refine models of
    functional regions

16
CODE
  • Telethon funded project to perform comparative
    analysis on selected disease gene families
  • Integrate work described with manual curation of
    each family
  • Keep curated database of features important for
    the disease, e.g. alternative splice variants,
    mutation sites,etc.
  • Identify those features automatically in genomes
    analyzed
  • Tight collaboration with TIGEM groups
  • verifying in vivo our in silico results
  • Add experimental info to the database, such as
    in-situ results, cellular sublocalization
    results, yeast-2-hybrid assays,etc.

17
TransCODE
  • FP6 (hopefully!) funded project on
    characterization of transcription factors
  • Integrate in-vitro SELEX-SAGE assays
  • Run all Ciona TFs (600) and some vertebrate TFs
    on column, wash with all degenerate x-mers
  • Sequence all x-mers that binding affinity to TF
  • Build refined models of TF binding site
  • Compare model between Ciona and verterbate
    (hypothesis being that they are very similar!)
  • Test identified regions in mouse, zebrafish
    (transient expression assay), Ciona
  • In Ciona and zebrafish verify direct action of a
    TF by co-injecting reporter construct with
    morpholino of hypothesised TF

18
Acknowledgments
  • Telethon CODE and CORE grants!
  • Bioinformatics group at TIGEM
  • Pedro Cruz (ncRNA analysis)
  • Vincenza Maselli (gene family analysis)
  • Remo Sanges (Alignment pipelines)
  • IT core
  • Giampiero Lago (Webmaster)
  • Marco Savarese (Helpdesk)
  • Mario Traditi (System Administrator)
Write a Comment
User Comments (0)
About PowerShow.com