Title: Reference Genome Project
1Reference Genome Project
ZFIN
2(No Transcript)
3Purpose
- - Provide comprehensive annotation for 12
genomes - Arabidopsis thaliana
- Caenorhabditis elegans
- Danio rerio
- Dictyostelium discoideum
- Drosophila melanogaster
- Escherichia coli
- Gallus gallus
- Homo sapiens
- Mus musculus
- Rattus norvegicus
- Saccharomyces cerevisiae
- Schizosaccharomyces pombe
- Those organisms were selected because they are
- established model organisms with published
experimental data - have a genome database
- have experienced GO curators
4Complete genome annotation
- Breadth every gene in the 12 genomes be
annotated - Depth every gene be annotated to the highest
level of knowledge
- The group has agreed that depth of annotation
is best assessed by the curator annotating the
gene. - If a gene has less than 5-10 papers, it
makes sense to read and annotate all papers - - If a gene has a lot of literature, the
preferred strategy is to look at a recent review
to make sure all important primary literature is
captured more recent papers be read
5Metrics assessing breadth and depth of
annotations
Need GFF3 files for all Ref Genomes!
- Breadth
- Number of genes (protein coding and functional
RNAs based on SO) - Number of genes with some functional annotation
- Number of genes with functional annotation based
on experiments using that organism - Number of genes with function inferred by
sequence similarity - Number of genes with function inferred by
electronic annotations - Number of genes for which there is no available
information (root/ND annotations) - Depth
- Number of papers linked to a gene
- Number of papers used to produce functional
annotation - Number of papers read but for which no new
annotations were produced. - Ratio of deepest annotation to leaf node to
measure granularity and use of the ontology (Suzi)
Depth needs to be assessed by curators!!
http//gocwiki.geneontology.org/index.php/Metrics
_breath_and_depth_of_annotations
6Figures for Reference Genome genes, completely
fictitious data
Figures for Whole Genome genes, completely
fictitious data
From Ruth (word doc sent earlier this week)
7Mike Cherry
8(No Transcript)
9Measuring Information Content
Organism Distance to leaf Distance to leaf Information Content Information Content Coverage Coverage Pubs per gene Pubs per gene Terms per gene Terms per gene
All Ref All Ref All Ref All Ref All Ref
Arabidopsis thaliana 4.29 3.74 11.53 12.01 17.06 28.28 2.02 2.20 2.88 3.46
Caenorhabditis elegans 4.74 4.37 11.40 11.90 27.98 48.94 1.73 3.23 4.72 9.51
Danio rerio 4.35 4.01 11.01 11.66 19.34 33.56 1.65 2.36 6.12 8.10
Dictyostelium discoideum 4.40 3.46 11.37 12.77 25.84 41.56 1.89 2.46 4.44 6.41
Drosophila melanogaster 3.56 3.03 12.86 13.37 33.29 55.69 3.12 5.34 4.26 6.55
H. sapiens 3.65 3.65 12.96 12.87 30.80 73.71 6.01 6.01 4.65 6.82
Mus musculus 4.83 3.50 10.90 13.37 32.44 84.21 2.95 10.69 5.47 13.01
R. norvegicus 4.08 3.26 11.56 13.10 36.27 77.81 2.29 5.71 5.07 8.00
S. pombe 3.82 3.05 11.94 12.96 44.15 59.54 3.13 4.01 6.91 17.55
Saccharomyces cerevisiae 3.22 2.63 12.88 14.08 37.42 62.84 3.04 5.67 4.65 6.82
Chris Mungall
10Priorities Selection of curation targets
- Genes that, when mutated, cause a disease
- Not included upregulated in cancer x,
interacts with tumor suppressor y, and other
weak evidence - Disease gene lists - OMIM- RGD disease portal
first group neurological diseases - Other lists - list of common genes between
human, fly and zebrafish that were being used as
a test case for PATO annotations many were not
in OMIM revisit?? - Current status trying to focus on genes with
the broadest interest, however these often lack
orthologs in yeast, E. coli, etc, so need to
balance these factors.
11Orthologs
- Curators for each database are responsible for
identifying orthologs of the disease gene - Available tools - YOGY- InParanoid-
OrthoMCL- TreeFam- Homologene - Sequence analysis by curators
12Software
- Google spreadsheet - shared by all curators-
each database keeps track of putative orthologs-
each database records the curation status for
each gene - Software requirements - Ensures consistent use
of identifiers- Allow loading of MOD reports-
Track that no ortholog was found- Provide
reports to focus curation effort- Record that
curation is 'comprehensive' as of a certain
date- Allow a 1many relation between Human gene
and MOD ortholog- Record orthology determination
method
13http//rails-dev.bioinformatics.northwestern.edu2
4000/index.html
14Annotation Progress
Organism Genes with Ortholog Gene Curated Curated genes with publications
A. thaliana 32 99 32
C. elegans 65 46 99
D. discoideum 40 41 26
D. melanogaster 48 50 67
D. rerio 87 90 26
H. sapiens 44 98
M. musculus 98 84 94
R. norvegicus 96 100 55
S. cerevisiae 30 100 99
S. pombe 34 81 33
- Curation software will be able to generate that
information - We would like to display the list of selected
genes, the list of identified orthologs, the
curation status and a way to access annotations
(graphs)
15Annotation Consistency Comparing annotations
16(No Transcript)
17Ontology development
Number of Source Forge requests in the "Reference
Genome" group
18Outreach publicizing the reference genome effort
- Several suggestions
- GO newsletter (already have the gene of the
quarter) could add diseases - NCBI/OMIM could display/advertise genes with
annotations - Take advantage of user requests that fit nicely
in the initiative - Set up a reference genome wiki page showing which
genes are coming up for annotation, which could
also be used by researchers to suggest target
genes - Make a page on the GO website that would include
diseases genes we are curating and the gene of
the quarter articles - Special display in AmiGO
- Provide annotations in a separate file
- Mark disease genes specifically in MODs
http//gocwiki.geneontology.org/index.php/Outreach
_publicizing_the_project_and_developing_a_web_pre
sence