Title: Computational Discovery of Communicable Knowledge
1Computational Discovery of Communicable
Scientific Models
Pat Langley Center for the Study of Language and
Information Stanford University, Stanford,
California http//cll.stanford.edu/langley langle
y_at_csli.stanford.edu
Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, S.
Dzeroski, J. Sanchez, Oren Shiran, and L.
Todorovski for their contributions to this
research, which is funded by a grant from the
National Science Foundation.
2Data Mining vs. Scientific Discovery
There exist two computational paradigms for
discovering explicit knowledge from data
- Data mining generates knowledge cast as decision
trees, logical rules, or other notations
invented by AI researchers - Computational scientific discovery instead uses
equations, structural models, reaction pathways,
or other formalisms invented by scientists and
engineers.
Both approaches draw on heuristic search to find
regularities in data, but they differ
considerably in their emphases.
3Lesson 1
Traditional notations from machine learning are
not communicated easily to domain scientists.
Ecosystem model
Gene regulation model
NPPc Smonth max (E IPAR, 0) E 0.56 T1
T2 W T1 0.8 0.02 Topt 0.0005
Topt2 T2 1.18 / (1 e 0.2 (Topt
Tempc 10) ) (1 e 0.3 (Tempc Topt 10)
) W 0.5 0.5 EET / PET PET
1.6 (10 Tempc / AHI)A PET-TW-M if Tempc gt
0 PET 0 if Tempc lt 0 A
0.00000068 AHI3 0.000077 AHI2 0.018 AHI
0.49 IPAR 0.5 FPAR-FAS Monthly-Solar
Sol-Conver FPAR-FAS min (SR-FAS 1.08)
/ SR (UMD-VEG) , 0.95 SR-FAS
(Mon-FAS-NDVI 1000) / (Mon-FAS-NDVI 1000)
4Lesson 2
Scientists often have initial models that should
influence the discovery process.
Discovery
Initial model
Observations
m
Revised model
5Lesson 3
Scientific data are often rare and difficult to
obtain rather than being plentiful.
Ecosystem model
Gene regulation model
Number of variables Number of initial links
Number of possible links Number of samples
9 11 ?70 20
6Lesson 4
Scientists want models that move beyond
description to provide explanations of their data.
Ecosystem model
Gene regulation model
7Lesson 5
Scientists want computational assistance rather
than automated discovery systems.
Discovery
Initial model
Observations
Revised model
8The Nature of Systems Science
Disciplines like Earth science and computational
biology differ from traditional fields in that
they
- focus on synthesis rather than analysis in their
operation - rely on computer modeling as one of their central
methods - develop system-level models with many variables
and relations - require that models make contact with known
mechanisms.
However, existing methods for computational
scientific discovery were not designed with
systems science in mind.
9Time Series from the Ross Sea Ecosystem
10Inductive Process Modeling
Our approach is to design and implement
computational methods for inductive process
modeling, which
- represent scientific models as sets of
quantitative processes - use these models to predict and explain
observational data - search a space of process models to find good
candidates - utilize background knowledge to constrain this
search.
This framework has great potential both for
modeling scientific reasoning and aiding
practicing scientists.
11Existing Formalisms Are Inadequate
12A Process Model for an Aquatic Ecosystem
model AquaticEcosystem variables phyto, zoo,
nitro, residue observables phyto, nitro process
phyto_loss equations dphyto,t,1 ? 0.307 ?
phyto dresidue,t,1 0.307 ? phyto process
zoo_loss equations dzoo,t,1 ? 0.251 ?
zoo dresidue,t,1 0.251 process
zoo_phyto_grazing equations dzoo,t,1 0.615
? 0.495 ? zoo dresidue,t,1 0.385 ? 0.495 ?
zoo dphyto,t,1 ? 0.495 ? zoo process
nitro_uptake conditions nitro gt 0
equations dphyto,t,1 0.411 ?
phyto dnitro,t,1 ? 0.098 ? 0.411 ?
phyto process nitro_remineralization
equations dnitro,t,1 0.005 ?
residue dresidue,t,1 ? 0.005 ? residue
13Advantages of Quantitative Process Models
Process models offer scientists a promising
framework because
- they embed quantitative relations within
qualitative structure - that refer to notations and mechanisms familiar
to experts - they provide dynamical predictions of changes
over time - they offer causal and explanatory accounts of
phenomena - while retaining the modularity needed for
induction/abduction.
Quantitative process models provide an important
alternative to formalisms used currently in
computational discovery.
14Challenges of Inductive Process Modeling
Process model induction differs from typical
learning tasks in that
- process models characterize behavior of dynamical
systems - variables are continuous but can have
discontinuous behavior - observations are not independently and
identically distributed - models may contain unobservable processes and
variables - multiple processes can interact to produce
complex behavior.
Compensating factors include a focus on
deterministic systems and the availability of
background knowledge.
15Encoding Background Knowledge
To constrain candidate models, we can utilize
available backround knowledge about the domain.
Previous work has encoded background knowledge in
terms of
- Horn clause programs (e.g., Towell Shavlik,
1990) - context-free grammars (e.g., Dzeroski
Todorovski, 1997) - prior probability distributions (e.g., Friedman
et al., 2000)
However, none of these notations are familiar to
domain scientists, which suggests the need for
another approach.
16Generic Processes as Background Knowledge
We cast background knowledge as generic processes
that specify
- the variables involved in a process and their
types - the parameters appearing in a process and their
ranges - the forms of conditions on the process and
- the forms of associated equations and their
parameters.
Generic processes are building blocks from which
one can compose a specific process model.
17Generic Processes for Aquatic Ecosystems
generic process exponential_loss generic process
remineralization variables Sspecies,
Ddetritus variables Nnutrient,
Ddetritus parameters ? 0, 1 parameters
? 0, 1 equations dS,t,1 ?1 ? ? ? S
equations dN, t,1 ? ? D dD,t,1 ? ?
S dD, t,1 ?1 ? ? ? D generic process
grazing generic process constant_inflow
variables S1species, S2species, Ddetritus
variables Nnutrient parameters ? 0, 1, ?
0, 1 parameters ? 0, 1
equations dS1,t,1 ? ? ? ? S1
equations dN,t,1 ? dD,t,1 (1 ? ?) ? ? ?
S1 dS2,t,1 ?1 ? ? ? S1 generic process
nutrient_uptake variables Sspecies,
Nnutrient parameters ? 0, ?, ? 0, 1, ?
0, 1 conditions N gt ? equations dS,t,1
? ? S dN,t,1 ?1 ? ? ? ? ? S
18Inducing Process Models
training data
process model
Induction
generic processes
19A Method for Process Model Construction
The IPM algorithm constructs explanatory models
from generic elements components in four stages
1. Find all ways to instantiate known generic
processes with specific variables, subject to
type constraints 2. Combine instantiated
processes into candidate generic models subject
to additional constraints (e.g., number of
processes) 3. For each generic model, carry
out search through parameter space to find good
coefficients 4. Return the parameterized model
with the best overall score.
Our typical evaluation metric is squared error,
but we have also explored other measures of
explanatory adequacy.
20Estimating Parameters in Process Models
To estimate the parameters for each generic model
structure, the IPM algorithm
1. Selects random initial values that fall within
ranges specified in the generic processes 2.
Improves these parameters using the
Levenberg-Marquardt method until it reaches a
local optimum 3. Generates new candidate values
through random jumps along dimensions of the
parameter vector and continue search 4. If no
improvement occurs after N jumps, it restarts the
search from a new random initial point.
This multi-level method gives reasonable fits to
time-series data from a number of domains, but it
is computationally intensive.
21Observations from the Ross Sea
22Results on Training Data from Ross Sea
23Results on Test Data from Ross Sea
24Results on a Protist Ecosystem
25Results on Rinkobing Fjord
26Results on Biochemical Kinetics
observed trajectories
predicted trajectories
27Interfacing with Scientists
Because few scientists want to be replaced, we
are developing an interactive environment,
PROMETHEUS, that lets users
- specify a quantitative process model of the
target system - display and edit the models structure and
details graphically - simulate the models behavior over time and
situations - compare the models predicted behavior to
observations - invoke a revision module in response to detected
anomalies.
The environment offers computational assistance
in forming and evaluating models but lets the
user retain control.
28Viewing a Process Model Graphically
29Indicating Processes to Consider Adding
30Specifying Data and Search Parameters
31Inspecting Revised Process Models
32Intellectual Influences
Our approach to computational discovery
incorporates ideas from many traditions
- computational scientific discovery (e.g., Langley
et al., 1983) - theory revision in machine learning (e.g.,
Towell, 1991) - qualitative physics and simulation (e.g., Forbus,
1984) - languages for scientific simulation (e.g.,
STELLA, MATLAB) - interactive tools for data analysis (e.g.,
Schneiderman, 2001).
Our work combines, in novel ways, insights from
machine learning, AI, programming languages, and
human-computer interaction.
33Contributions of the Research
In summary, our work on computational scientific
discovery has, in responding to various
challenges, produced
- a new formalism for representing scientific
process models - a computational method for simulating these
models behavior - an encoding for background knowledge as generic
processes - an algorithm for inducing process models from
time-series data - an interactive environment for model
construction/utilization.
We have demonstrated this approach to model
creation on domains from Earth science,
microbiology, and engineering.
34Some Recent Extensions
In recent work, we have extended our approach to
incorporate
- heuristic beam search through the space of
process models - hierarchical generic processes that further
constrain search - an ensemble-like method that mitigates
overfitting effects - metrics for explanatory adequacy based on
trajectory shapes.
Inductive process modeling has great potential to
speed progress in systems science and engineering.
35End of Presentation