Title: Formal Structuring of Genomic Knowledge
1Formal Structuring of Genomic Knowledge
- Nigam Shah
- Postdoctoral Fellow, SMI
- nigam_at_stanford.edu
2The Understanding cycle
Evaluate for consistency with known information
Formulate hypothesis
Identify conflicts and suggest corrections
Get best possible match with data
Store validated hypotheses
Design experiment to test hypothesis
HyBrow assists in the tasks bound by the red
outline
3Walking along this cycle is hard
- The way much of biology works is by applying
prior knowledge (what is known) for
interpreting datasets rather than the application
of a set of axioms that will elicit knowledge.
(Stevens et al, 2000) - We need to explicitly articulate what is known
thats a problem with the current information
overload. - If we explicitly articulate what is known, in
an organizing framework, it serves as a reference
for integrating new data with prior knowledge. - And increases our ability to fit the results into
the big picture.
4How can we make it easier?
- If we design a framework for making statements or
sets of statements, comprising a hypothesis,
about biological processes and systematically
examine a wide variety of datasets for evaluating
them. - We can speed up the understanding cycle.
5Events and Implicit claims
- An hypothesis is a statement about relationships
(among objects) within a biological system. - Protein P induces transcription of gene X
- An event is a relationship between two
biological entities, which we call agents.
P
promoter gene X
- Implicit claims that can be tested
- P is a transcription factor.
- P is a transcriptional activator.
- P is localized to the nucleus.
- P can bind to the promoter of gene X
6Components of a formal representation
Formal representation
Domain knowledge model (Ontology)
Conceptual framework
Establish a correspondence between the conceptual
framework and the ontology
Knowledgebase
Domain information and knowledge structured into
the knowledge model
Data generated by researchers. Not always
accessible or available in a Model Organism
Database (except sequence and microarray data)
Curated data information. Large amount of
information is created stored by model organism
databases
Database
7The conceptual framework
Event ? Subject.Verb.Object Event ?
Subject.Verb.Object.Context Event ?
Subject.Verb.Object.Context.AssocCond Subject ?
(Actor Context Event) Verb ? (Physical
Biochemical Logical) Object ? (Actor Context
Event) Actor ? (Gene Protein Complex
) Context ? (Physical Genetic
Temporal) AssocCond ? (Presence of absence
of).Agent
- The terminal symbols which cannot be further
decomposed in a grammar are supplied by the
hypothesis ontology. - This grammar together with the hypothesis
ontology, allows us to represent hypotheses in a
formal language
We have specified methods to evaluate formal
language hypotheses for internal consistency
agreement with existing knowledge.
8The conceptual framework
- Consistency of an hypothesis with prior knowledge
is evaluated by applying constraints and rules. - A constraint is a statement specifying the
evidence that contradicts or supports an event. - A protein must be in the nucleus to bind to a
promoter. - A rule comprises the steps for deciding whether
a constraint is satisfied or violated.
Binds_to_promoter P, g Annotation
constraints if cellular location of P is not
nucleus, give a penalty. if biological process
is not transcription, give a penalty.
9Components of a formal representation
Formal representation
Domain knowledge model (Ontology)
Conceptual framework
Establish a correspondence between the conceptual
framework and the ontology
Knowledgebase
Domain information and knowledge structured into
the knowledge model
Data generated by researchers. Not always
accessible or available in a Model Organism
Database (except sequence and microarray data)
Curated data information. Large amount of
information is created stored by model organism
databases
Database
10Hypothesis Ontology
- Expressive enough to describe the galactose
system at a coarse level of detail. - It is compatible with other ontology efforts.
- E.g. GO so that GO annotations can be used
directly in HyBrow. - We have also developed a grammar to write
hypotheses using events from this ontology.
11Grammar for a hypothesis
A hypothesis consists of at least one event
stream An event stream is a sequence of one or
more events or event streams with logical joints
(or operators) between them. An event has exactly
one agent_a, exactly one agent_b and exactly one
operator (i.e. a relationship between the two
agents). It also has a physical location that
denotes where the event happened, the genetic
context of the organism and associated
experimental perturbations when the event
happened. A logical joint is the conjunction
between two event streams.
12Components of a formal representation
Formal representation
Domain knowledge model (Ontology)
Conceptual framework
Establish a correspondence between the conceptual
framework and the ontology
Knowledgebase
Domain information and knowledge structured into
the knowledge model
Data generated by researchers. Not always
accessible or available in a Model Organism
Database (except sequence and microarray data)
Curated data information. Large amount of
information is created stored by model organism
databases
Database
13Constraints
- A constraint is a statement specifying the
evidence that supports or contradicts an event. - Types of constraints
- Ontology
- Data
- Existence
- Temporal
- X binds to promoter of Y
- Ontology
- X must be a protein, complex Y must be a gene
- Data
- X must be annotated to be localized to the
nucleus. - The promoter of Y must have a binding site for X
- Existence
- The gene for X must be present
14Rules
A rule decides whether a constraint is satisfied
or violated.
The first layer of rules enforce the constraints
to decide support or conflict based on the data
we have.
A second layer of rules check the logical
structure of the hypothesis
15Components of a formal representation
Formal representation
Domain knowledge model (Ontology)
Conceptual framework
Establish a correspondence between the conceptual
framework and the ontology
Knowledgebase
Domain information and knowledge structured into
the knowledge model
Data generated by researchers. Not always
accessible or available in a Model Organism
Database (except sequence and microarray data)
Curated data information. Large amount of
information is created stored by model organism
databases
Database
16The knowledgebase
Proteomics
Microarray
HyBrow KB
Sequence
Literature
17User interfaces
Hypothesis described in Natural Language
Biological process described in a formal language
18Evaluating an hypothesis
19Evaluating an hypothesis
20Screen shot of the output
A list of events in the submitted hypothesis
A plot of the counts of support and conflicts
An explanation for each support / conflict with a
link to the data source
21HyBrow take home
- The minimum requirement for a formal
representation - Ability to represent data ? information ?
Knowledge - A language to express your thought experiment
(your model, hypothesis, theory, theorem etc) - A reasoning framework to evaluate the outcome/
validity/accuracy of your thought experiment - We should not aim to use all the data and come up
with ONE model that explains everything. - It is much better to propose a model and examine
if your data supports/contradicts it
22A clinical example
- Autism is a developmental disability
characterized by severe and pervasive impairment
in several areas of development. - Nutrigenomics is gathering a lot of attention in
Autism treatment - DAN! (defeat autism now!) researchers sometimes
refer to this as biomedical treatment - Tests for deciding the optimal nutrigenomics
therapy are costly and hard to interpret
23Excerpt from a parents email
- right now, that is a manual process to relate
the genetic (mutation info...) and any microbial
inputs to a biochemical pathway diagram and
relate the mutations to specific supplement or
enzyme therapies. It costs gt 1000 and 6-8 months
for someone to manually interpret the results. - I was wondering if it would be helpful to develop
a model to contain the static/known information
and some dynamic models to help answer some
interesting questions relevant to the person's
data. - This might make it possible to develop tools for
a physician or motivated individual to use
nutrigenomic information.
24Credits and acknowledgements
- Stephen Racunas
- Co-developer of HyBrow
25Orgnanon
- an Organon, an instrument for the proper conduct
and representation of scientific research. - The first Organon was written by the Ancient
Greek philosopher Aristotle in the 4th Century
B.C., and included his works on logic and the
theory of science.1 - The second great Organon, the Novum Organum
(1620) of Francis Bacon was written as an update,
extension and correction of the Aristotelian
Organon in light of the success and experimental
methods of post-Galilean modern natural science
almost 2000 years latter.2 - 1 The works known as Aristotles Organon can be
found in The Complete Works of Aristotle, Two
Volumes (Jonathan Barnes ed.). Princeton
Princeton University Press, 1984. - 2 Bacon, F. Novum Organum (Urback, P. and
Gibson, J. transl. and eds.). Chicago Open
Court, 1994.