Ontologies and Biomedicine - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Ontologies and Biomedicine

Description:

The 'right' amount of semantics depends on what you ... toad. B. SP:48392. yeast. B. C. SP:48291. SP:38921. Direct evidence. Direct evidence. Indirect evidence ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 39
Provided by: suzann49
Category:

less

Transcript and Presenter's Notes

Title: Ontologies and Biomedicine


1
Ontologies and Biomedicine
  • What is the "right" amount of semantics?

2
Ontologies and Biomedicine
  • The right amount of semantics depends on what
    you want to do with it

3
Ontologies and Biomedicine
  • Research is based on inference from what is
    known, and therefore it demands rigor

4
Ontologies and Biomedicine
  • Without rigor, we wontknow what we know, or
    where to find it, or what to infer from it.

5
Semantic Spectrum
Ambiguous
Logical and precise
Natural Language
Computable Ontology
Highly expressive
Less expressive
6
Ad hoc tagging approach
  • Let the users defined words and phrases
  • Foregoes the use of an expertly curated
    vocabulary or ontology.
  • Fast and distributed approach yields a vast
    amount of content
  • No recruitment and training of people to maintain
    the ontology is required.
  • No recruitment and training of annotators to
    interpret the material is required.

7
(No Transcript)
8
Ad hoc tagging approach
  • Tagging approach places the burden of
    interpretation and classification on every end
    user
  • Overall this is more costly and wasteful
  • Is inappropriate in the scientific domain
  • The problem is not about people communicating. It
    is about computers and HCI.

9
Build, apply, and use
  • Ontology captures current scientific theory that
    seeks to explain all of the existing evidence and
    is used to draw inferences and make predictions
  • Acts like a review
  • Requires curators who are experts in both the
    science and logic
  • Ontology application is the real bottleneck
  • But overall is less costly and wasteful

10
  • Univocity
  • Terms should have the same meanings on every
    occasion of use
  • Positivity
  • Terms such as non-mammal or non-membrane do
    not designate genuine classes.
  • Objectivity
  • Terms such as unknown or unclassified or
    unlocalized do not designate biological natural
    kinds.
  • Single Inheritance
  • No class in a classification hierarchy should
    have more than one is_a parent on the immediate
    higher level
  • Intelligible Definitions
  • The terms used in a definition should be simpler
    (more intelligible) than the term to be defined
  • Reality Based
  • When building or maintaining an ontology, always
    think carefully at how classes relate to
    instances in reality
  • Distinguish Classes and Instances
  • What is necessarily true for instances is not
    necessarily true for classes

11
Annotation bottleneck
  • An active lab can easily generate 10-100GB of
    data per month, and it is very difficult to
    manage data on this scale.
  • Even the best analytic schemes will be for naught
    if we cannot find our data.
  • And the data is complex
  • Yet, the annotation effort required will be
    utterly wasted if it cannot be reliably computed
    upon.

12
(No Transcript)
13
Implies numerous light ontologies
  • 3-dimensions
  • Protein function
  • Cell type
  • Tissue
  • Stage
  • Cellular component
  • Organism
  • And more

14
Or it implies a single complex one
  • 3-dimensions
  • Protein function
  • Cell type
  • Tissue
  • Stage
  • Cellular anatomy
  • Organism
  • And more
  • Plus all of the relations between these elements

15
Practicalities
  • The ontology should be robust or the annotators
    time is wasted
  • Research wont wait, data must be annotated at
    the rate at which it is generated
  • Complex ontologies are much more difficult to get
    right than lighter ones
  • Light ontologies are easier to build and maintain
  • Complex ontologies can be built from lighter ones

16
A successful case study
  • Gene Ontology

17
The aims of GO
  • To develop comprehensive shared vocabularies of
    terms describing aspects of molecular biology.
  • To describe the gene products held in each
    contributing model organism database.
  • To provide a scientific resource for access to
    the vocabularies, the annotations, and associated
    data.
  • To provide a software resource to assist in
    curation of GO term assignments to biological
    objects.

18
The primary strength of the GO
  • The GO covers three domains of biology
  • Molecular Function
  • Biological Process
  • Cellular Component
  • These are precisely defined axes of
    classification

19
The breakdown of work
  • Task 1
  • Building the ontology a computable description
    of the biological world
  • Task 2
  • Describing your gene productannotation
  • Biological process
  • Molecular function
  • Cellular localization

20
The early key decisions
  • The vocabulary itself requires a serious and
    ongoing effort.
  • Carefully define every concept
  • Initially keep things as simple as possible and
    only use a minimally sufficient data
    representation.
  • Focus initially on molecular aspects that are
    shared between many organisms.

21
GO databases distributed and centralized
  • Support cross-database queries
  • By having a mutual understanding of the
    definition and meaning of any word used to
    describe a gene product
  • Provide database access to a common repository of
    annotations
  • By submitting a summary of gene products that
    have been annotated

22
(No Transcript)
23
(No Transcript)
24
GODatabase.org
  • Hits 77,012
  • Visits 14,063
  • Sites 6,638
  • Averages per week

25
(No Transcript)
26
Number of links to a site as reported by Google
www.geneontology.org 7,240 www.godatabase.org
33 obo.sourceforge.net 10 song.sourceforge.net
6 genome.ucsc.edu 3,670 www.ncbi.nih.gov
12,000 www.ebi.ac.uk 14,900 sciencemag.org
14,900 www.ncbi.nlm.nih.gov 34,500
27
Most Common GOIDs accessed via AmiGO
72020 GO0006810 transport 56862 GO0005524 ATP
binding 53622 GO0019012 virion 47773 GO0006955 i
mmune response 46943 GO0003677 DNA
binding 41474 GO0006508 proteolysis and
peptidolysis 41126 GO0006355 regulation of
transcription, DNA-dependent 40427 GO0004872 rece
ptor activity 34943 GO0005215 transporter
activity 30890 GO0007186 G-protein coupled
receptor protein signaling pathway 30001 GO000370
0 transcription factor activity 28127 GO0006118 e
lectron transport 26636 GO0005509 calcium ion
binding 24007 GO0006968 cellular defense
response 21250 GO0016486 peptide hormone
processing 20440 GO0008152 metabolism 19742 GO00
05515 protein binding 19316 GO0007155 cell
adhesion 18254 GO0005198 structural molecule
activity
28
Taxon covered by the GO (some)
Arabidopsis TAIR, taxon3702 Caenorhabditis
WormBase, taxon6239 Candida albicans CGD,
taxon5476 Danio ZFIN, taxon7955 Dictyostelium
DictyBase, taxon5782 Drosophila FlyBase,
taxon7227 Mus MGI, taxon10090 Oryza sativa
Gramene, taxon39947 Oryza sativa (japonica
cultivar-group) Rattus RGD, taxon10116 Sacchar
omyces SGD, taxon4932 Leishmania major GeneDB,
taxon5664 Plasmodium falciparum GeneDB,
taxon5833 Schizosaccharomyces pombe GeneDB,
taxon4896 Trypanosoma brucei GeneDB,
taxon185431 Bacillus anthracis TIGR,
taxon198094 Coxiella burnetii TIGR,
taxon227377 Geobacter sulfurreducens TIGR,
taxon243231 Listeria monocytogenes TIGR,
taxon265669 Methylococcus capsulatus TIGR,
taxon243233 Pseudomonas syringae TIGR,
taxon223283 Shewanella oneidensis TIGR,
taxon211586 Vibrio cholerae TIGR, taxon686
29
NIH-funded experimental research that uses the GO
  • National Institute on Aging (NIA)
  • National Institute of Allergy and Infectious
    Diseases (NIAID)
  • National Cancer Institute (NCI)
  • National Institute on Drug Abuse (NIDA)
  • National Institute on Deafness and Other
    Communication Disorders (NIDCD)
  • National Institute of Dental Craniofacial
    Research (NIDCR)
  • National Institute of Diabetes and Digestive and
    Kidney Diseases (NIDDK)
  • National Institute of Biomedical Imaging and
    Bioengineering (NIBIB)
  • National Institute of Environmental Health
    Sciences (NIEHS)
  • National Eye Institute (NEI)
  • National Institute of General Medical Sciences
    (NIGMS)
  • National Institute of Child Health and Human
    Development (NICHD)
  • National Human Genome Research Institute (NHGRI)
  • National Heart, Lung and Blood Institute (NHLBI)
  • National Library of Medicine (NLM)
  • National Institute of Neurological Disorders and
    Stroke (NINDS)
  • National Center for Research Resources (NCRR)

30
Other funded experimental projects that use the GO
  • Public Heath Service
  • Walter Reed Army Medical Center
  • United States Department of Agriculture
  • Department of Defense
  • USAID
  • National Science Foundation

31
A successful case study
  • There are still challenges to meet

32
Building upon (sharing) light, axiomatic
ontologies eliminates
  • Spelling mistakes or differences
  • oesinophil vs. eosinophil
  • Differences in synonyms, names or naming
    conventions
  • Spermatazoon, sperm cell, spermatozoid, sperm
  • Differences in definitions
  • pericardial cell develops_from mesodermal cell
    vs. Nothing develops_from pericardial cell
  • Inconsistent structure

33
Inconsistent structure
GO
CL
hemocyte
hemocyte differentiation (sensu Arthropoda)
plasmocyte
lamellocyte differentiation
plasmatocyte differentiation
lamellocyte
34
Finer granularity in the GO
  • GO
  • immune cell
  • activation, migration, chemotaxis
  • erythrocyte differentiation is_a myeloid blood
    cell differentiation
  • CL
  • no such term immune cell
  • no such term myeloid blood cell

35
Courser granularity in the GO
  • GO
  • neuroblast proliferation is_a cell proliferation
  • CL
  • neuroblast is_a neuronal stem cell is_a stem cell
    is_a cell

36
Even a light ontology like the GO is difficult
enough
  • A methodology that enforces clear, coherent
    definitions
  • Promotes quality assurance
  • intent is not hard-coded into software
  • Meaning of relationships is defined, not inferred
  • Guarantees automatic reasoning across ontologies
    and across data at different granularities
  • Consequences of inconsistencies
  • Hard to synchronize manually
  • Inconsistent user-search results

37
Meeting the goal Drawing inferences
PMID5555
PMID4444
Direct evidence
Direct evidence
?
SP1234
SP8723
SP19345
A
B
C
D
Human
human
Indirect evidence
SP48392
B
PMID8976
Xenopus
toad
Indirect evidence
SP48291
SP38921
B
C
PMID3924
Drosophila
PMID9550
yeast
38
Thank you
NCBO Reactome GO SO
  • Chris Mungall
Write a Comment
User Comments (0)
About PowerShow.com