Title: Computational Discovery of Communicable Knowledge
1Computational Discovery of Explanatory Process
Models
Pat Langley Center for the Study of Language and
Information Stanford University, Stanford,
California http//cll.stanford.edu/langley langle
y_at_csli.stanford.edu
Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, A.
Pohorille, J. Sanchez, K. Saito, and J. Shrager
for their contributions to this research.
2Data Mining vs. Scientific Discovery
There exist two computational paradigms for
discovering explicit knowledge from data. The
data mining movement develops computational
methods that
- induce predictive models from large, often
business, data sets - cast models as decision trees, logical rules, or
other notations invented by AI researchers.
In contrast, computational scientific discovery
focuses on
- constructing models from (often small) scientific
data sets - stated in formalisms invented by scientists and
engineers.
Both approaches draw on heuristic search to find
regularities in data, but they differ
considerably in their emphases.
3In Memoriam
Three years ago, computational scientific
discovery lost two of its founding fathers
- Herbert A. Simon (1916 2001)
- Jan M. Zytkow (1945 2001)
Both contributed to the field in many ways
posing new problems, inventing methods, training
students, and organizing meetings. Moreover, both
were interdisciplinary researchers who
contributed to computer science, psychology,
philosophy, and statistics. Herb Simon and Jan
Zytkow were excellent role models who we should
all aim to emulate.
4Time Line for Research on Computational
Scientific Discovery
1989
1990
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Bacon.1Bacon.5
Abacus, Coper
Fahrehneit, E, Tetrad, IDSN
Hume, ARC
DST, GPN LaGrange
SDS
SSF, RF5, LaGramge
RL, Progol
?AM
Glauber
NGlauber
IDSQ, Live
HR
Dalton, Stahl
Gell-Mann
BR-3, Mendel
Pauli
Stahlp, Revolver
?Dendral
BR-4
IE
Coast, Phineas, AbE, Kekada
Mechem, CDP
Astra, GPM
Legend
5Successes of Computational Scientific Discovery
Over the past decade, systems of this type have
helped discover new knowledge in many scientific
fields
- qualitative chemical factors in mutagenesis (King
et al., 1996) - quantitative laws of metallic behavior (Sleeman
et al., 1997) - qualitative conjectures in number theory (Colton
et al., 2000) - temporal laws of ecological behavior (Todorovski
et al., 2000) - reaction pathways in catalytic chemistry
(Valdes-Perez, 1994)
Each has led to publications in the refereed
scientific literature (e.g., Langley, 2000), but
they did not focus on systems science.
6The Nature of Systems Science
Disciplines like Earth science and computational
biology differ from traditional fields in that
they
- focus on synthesis rather than analysis in their
operation - rely on computer modeling as one of their central
methods - develop system-level models with many variables
and relations - evaluate their models on observational, not
experimental, data.
Developing and testing such models are complex
tasks that would benefit from computational aids.
However, existing methods for computational
scientific discovery were not designed with
systems science in mind.
7Observations from the Ross Sea
8Inductive Process Modeling
Our response is to design, construct, and
evaluate computational methods for inductive
process modeling, which
- represent scientific models as sets of
quantitative processes - use these models to predict and explain
observational data - search a space of process models to find good
candidates - utilize background knowledge to constrain this
search.
This framework has great potential for aiding
systems science, but it raises new computational
challenges.
9Challenges of Inductive Process Modeling
Process model induction differs from typical
learning tasks in that
- process models characterize behavior of dynamical
systems - variables are continuous but can have
discontinuous behavior - observations are not independently and
identically distributed - models may contain unobservable processes and
variables - multiple processes can interact to produce
complex behavior.
Compensating factors include a focus on
deterministic systems and the availability of
background knowledge.
10Issue 1 Representing Scientific Models
To assist system scientists modeling efforts, we
must first encode candidate models that
- address observational rather than experimental
data - deal with dynamic systems that change over time
- have an explanatory rather than a descriptive
character - are causal in that they describe chains of
effects - contain quantitative relations and qualitative
structure.
We need some formal way to represent such models
that can be interpreted computationally.
11Why Are Existing Formalisms Inadequate?
12A Process Model for an Aquatic Ecosystem
model Ross_Sea_Ecosystem variables phyto,
nitro, residue, light, growth_rate,
effective_light, ice_factor observables phyto,
nitro, light, ice_factor process phyto_loss
equations dphyto,t,1 ? 0.1 ?
phyto dresidue,t,1 0.1 ? phyto process
phyto_growth equations dphyto,t,1
growth_rate ? phyto process phyto_uptakes_nitro
conditions nitro gt 0 equations dnitro,t,1
? 1 ? 0.204 ? growth_rate ? phyto process
growth_limitation equations growth_rate 0.23
? min(nitrate_rate, light_rate) process
nitrate_availability equations nitrate_rate
nitrate / (nitrate 5) process
light_availability equations light_rate
effective_light / (effective_light 50) process
light_attenuation equations effective_light
light ? ice_factor
13Advantages of Quantitative Process Models
Process models are a good target for discovery
systems because
- they embed quantitative relations within
qualitative structure - that refer to notations and mechanisms familiar
to scientists - they provide dynamical predictions of changes
over time - they offer causal and explanatory accounts of
phenomena - while retaining the modularity needed to support
induction.
Quantitative process models provide an important
alternative to formalisms used currently in
computational discovery.
14Issue 2 Generating Predictions and Explanations
To utilize or evaluate a given process model, we
must simulate its behavior over time
- specify initial values for input variables and
time step size - on each time step, determine which processes are
active - solve active algebraic/differential equations
with known values - propagate values and recursively solve other
active equations - when multiple processes influence the same
variable, assume their effects are additive.
This performance method makes specific
predictions that we can compare to observations.
15Issue 3 Encoding Background Knowledge
To constrain candidate models, we can utilize
available backround knowledge about the domain.
Previous work has encoded background knowledge in
terms of
- Horn clause programs (e.g., Towell Shavlik,
1990) - context-free grammars (e.g., Dzeroski
Todorovski, 1997) - prior probability distributions (e.g., Friedman
et al., 2000)
However, none of these notations are familiar to
domain scientists, which suggests the need for
another approach.
16Generic Processes as Background Knowledge
Our framework casts background knowledge as
generic processes that specify
- the variables involved in a process and their
types - the parameters appearing in a process and their
ranges - the forms of conditions on the process and
- the forms of associated equations and their
parameters.
Generic processes are building blocks from which
one can compose a specific process model.
17Generic Processes for Aquatic Ecosystems
generic process exponential_loss generic process
remineralization variables Sspecies,
Ddetritus variables Nnutrient,
Ddetritus parameters ? 0, 1 parameters
? 0, 1 equations dS,t,1 ?1 ? ? ? S
equations dN, t,1 ? ? D dD,t,1 ? ?
S dD, t,1 ?1 ? ? ? D generic process
grazing generic process constant_inflow
variables S1species, S2species, Ddetritus
variables Nnutrient parameters ? 0, 1, ?
0, 1 parameters ? 0, 1
equations dS1,t,1 ? ? ? ? S1
equations dN,t,1 ? dD,t,1 (1 ? ?) ? ? ?
S1 dS2,t,1 ?1 ? ? ? S1 generic process
nutrient_uptake variables Sspecies,
Nnutrient parameters ? 0, ?, ? 0, 1, ?
0, 1 conditions N gt ? equations dS,t,1
? ? S dN,t,1 ?1 ? ? ? ? ? S
18Issue 4 Inducing Process Models
training data
process model
Induction
generic processes
19A Method for Process Model Induction
We have implemented the IPM algorithm, which
induces process models from generic components in
four stages
1. Find all ways to instantiate known generic
processes with specific variables, subject to
type constraints 2. Combine instantiated
processes into candidate generic models subject
to additional constraints (e.g., number of
processes) 3. For each generic model, carry
out search through parameter space to find good
coefficients 4. Return the parameterized model
with the best overall score.
The evaluation metric can be squared error or
description length (e.g., MD (MV MC ) ? log
(n) n ? log (ME ) .
20Estimating Parameters in Process Models
To estimate the parameters for each generic model
structure, the IPM algorithm
1. Selects random initial values that fall within
ranges specified in the generic processes 2.
Improves these parameters using the
Levenberg-Marquardt method until it reaches a
local optimum 3. Generates new candidate values
through random jumps along dimensions of the
parameter vector and continue search 4. If no
improvement occurs after N jumps, it restarts the
search from a new random initial point.
This multi-level method gives reasonable fits to
time-series data from a number of domains, but it
is computationally intensive.
21More Issues in Process Model Induction
Inductive process modeling raises a number of
issues that have clear analogues in other
paradigms
- identifying conditions on component processes
- inferring initial values of unobservable
variables - keeping the structural search space tractable
- reducing variance to mitigate overfitting effects
We have demonstrated promising responses to these
problems within the IPM framework.
22Evaluation of the IPM Algorithm
To demonstrate IPM's ability to induce process
models, we ran it on synthetic data for a known
system
1. We used the aquatic ecosystem model to
generate data sets over 100 time steps for the
variables nitro and phyto 2. We replaced each
true value x with x ? (1 r ? n), where r
followed a Gaussian distribution (? 0, ? 1)
and n gt 0 3. We ran IPM on these noisy data,
giving it type constraints and generic processes
as background knowledge.
In two experiments, we let IPM determine the
initial values and thresholds given the correct
structure in a third study, we let it search
through a space of 256 generic model structures.
23Experimental Results with IPM
The main results of our studies with IPM on
synthetic data were
1. The system infers accurate estimates for the
initial values of unobservable variables like zoo
and residue 2. The system induces estimates of
condition thresholds on nitro that are close to
the target values and 3. The MDL criterion
selects the correct model structure in all runs
with 5 noise, but only 40 of runs with 10
noise.
These suggest that the basic approach is sound,
but that we should consider more MDL schemes and
other responses to overfitting.
24Observations from the Ross Sea
25Results on Training Data from Ross Sea
26Results on Test Data from Ross Sea
27Collecting Data on Photosynthetic Processes
www.affymetrix.com/
Microarray Trace
/wwwscience.murdoch.edu.au/teach
Continuous Culture (Chemostat)
External stimuli (e.g., light)
Adaptation Period
Sampling mRNA/cDNA
Health of Culture
Equlibrium Period
www.affymetrix.com/
Time
28Gene Expressions for Cyanobacteria
29Generic Processes for Photosynthesis Regulation
generic process translation generic process
transcription variables Pprotein, MmRNA
variables MmRNA, Rrate parameters ? 0,
1 parameters equations dP,t,1 ? ? M
equations dM,t,1 R generic process
regulate_one generic process regulate_two
variables Rrate, Ssignal variables
Rrate, Ssignal parameters ? ?1 , 1
parameters ? ?1 , 1, ? 0, 1 equations R
? ? S equations R ? ? S dS, t,1 ?1
? ? ? S generic process automatic_degradation gen
eric process controlled_degradation variables
Cconcentration variables Dconcentration,
Econcentration conditions C gt 0
conditions D gt 0, E gt 0 parameters ? 0, 1
parameters ? 0, 1 equations dC,t,1 ?1 ?
? ? C equations dD,t,1 ?1 ? ? ?
E dE,t,1 ?1 ? ? ? E generic process
photosynthesis variables Llight, Pprotein,
Rredox, SROS parameters ? 0, 1, ? 0,
1 equations dR,t,1 ? ? L ? P dS,t,1
? ? L ? P
30A Process Model for Photosynthetic Regulation
model photo_regulation variables light,
mRNA_protein, ROS, redox, transcription_rate obser
vables light, mRNA process photosynthesis
equations dredox,t,1 0.0155 ? light ?
protein dROS,t,1 0.019 ? light ?
protein process protein_translation process
mRNA_transcription equations dprotein,t,1
7.54 ? mRNA equations dmRNA,t,1
transcription_rate process regulate_one_1 process
regulate_two_2 equations transcription_rate
0.99 ? light equations transcription_rate
1.203 ? redox dredox,t,1 ? 0.0002 ?
redox process automatic_degradation_1 process
controlled_degradation_1 conditions protein gt
0 conditions redox gt 0, ROS gt 0
equations dprotein,t,1 ? 1.91 ? protein
equations dredox,t,1 ? 0.0003 ?
ROS dROS,t,1 ? 0.0003 ? ROS
31Predictions from Best Parameterized Model
32Electric Power on the International Space Station
33Results on Battery Test Data
34Results on Data from Rinkobing Fjord
35Issue 5 Interfacing with Scientists
Because few scientists want to be replaced, we
are developing an interactive environment that
lets users
- specify a quantitative process model of the
target system - display and edit the models structure and
details graphically - simulate the models behavior over time and
situations - compare the models predicted behavior to
observations - invoke a revision module in response to detected
anomalies.
The environment offers computational assistance
in forming and evaluating models but lets the
user retain control.
36Viewing and Editing a Process Model
37Results of Revising the NPP Model
Initial model E 0.56 T1 T2 W
T2 1.18 / (1 e 0.2 (Topt Tempc 10) )
(1 e 0.3 (Tempc Topt 10) ) PET
1.6 (10 Tempc / AHI)A PET-TW-M SR ?
3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05,
4.05, 5.09, 4.05 RMSE on training data 465.212
and r 2 0.799 Revised model E 0.353
T10.00 T2 0.08 W 0.00 T2 0.83 / (1
e 1.0 (Topt Tempc 6.34) ) (1 e 1.0
(Tempc Topt 11.52) ) PET 1.6 (10
Tempc / AHI) A PET-TW-M SR ? 0.61, 3.99,
2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85,
1.61 Cross-validated RMSE 397.306 and r 2
0.853 15 reduction
38Intellectual Influences
Our approach to computational discovery
incorporates ideas from many traditions
- computational scientific discovery (e.g., Langley
et al., 1983) - theory revision in machine learning (e.g.,
Towell, 1991) - qualitative physics and simulation (e.g., Forbus,
1984) - languages for scientific simulation (e.g.,
STELLA, MATLAB) - interactive tools for data analysis (e.g.,
Schneiderman, 2001).
Our work combines, in novel ways, insights from
machine learning, AI, programming languages, and
human-computer interaction.
39Contributions of the Research
In summary, our work on computational scientific
discovery has, in responding to various
challenges, produced
- a new formalism for representing scientific
process models - a computational method for simulating these
models behavior - an encoding for background knowledge as generic
processes - an algorithm for inducing process models from
time-series data - an interactive environment for model
construction/utilization.
We have demonstrated this approach to model
creation on domains from Earth science,
microbiology, and engineering.
40Directions for Future Research
Despite our progress to date, we need further
work in order to
- produce additional results on other scientific
data sets - develop improved methods for fitting model
parameters - extend the approach to handle data sets with
missing values - implement heuristic methods for searching the
structure space - utilize knowledge of subsystems to further
constrain search - augment the modeling environment to make it more
usable
Inductive process modeling has great potential to
speed progress in systems science and
engineering.
41End of Presentation