Computational Discovery of Communicable Knowledge

About This Presentation

Title:

Computational Discovery of Communicable Knowledge

Description:

induce predictive models from large, often business, data sets; ... to model creation on domains from Earth science, microbiology, and engineering. ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 42

Provided by: Lang8

Learn more at: http://www.isle.org

Category:

more less

Transcript and Presenter's Notes

Title: Computational Discovery of Communicable Knowledge

1
Computational Discovery of Explanatory Process
Models
Pat Langley Center for the Study of Language and
Information Stanford University, Stanford,
California http//cll.stanford.edu/langley langle
y_at_csli.stanford.edu
Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, A.
Pohorille, J. Sanchez, K. Saito, and J. Shrager
for their contributions to this research.
2
Data Mining vs. Scientific Discovery
There exist two computational paradigms for
discovering explicit knowledge from data. The
data mining movement develops computational
methods that

induce predictive models from large, often
business, data sets
cast models as decision trees, logical rules, or
other notations invented by AI researchers.

In contrast, computational scientific discovery
focuses on

constructing models from (often small) scientific
data sets
stated in formalisms invented by scientists and
engineers.

Both approaches draw on heuristic search to find
regularities in data, but they differ
considerably in their emphases.
3
In Memoriam
Three years ago, computational scientific
discovery lost two of its founding fathers

Herbert A. Simon (1916 2001)
Jan M. Zytkow (1945 2001)

Both contributed to the field in many ways
posing new problems, inventing methods, training
students, and organizing meetings. Moreover, both
were interdisciplinary researchers who
contributed to computer science, psychology,
philosophy, and statistics. Herb Simon and Jan
Zytkow were excellent role models who we should
all aim to emulate.
4
Time Line for Research on Computational
Scientific Discovery
1989
1990
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Bacon.1Bacon.5
Abacus, Coper
Fahrehneit, E, Tetrad, IDSN
Hume, ARC
DST, GPN LaGrange
SDS
SSF, RF5, LaGramge
RL, Progol
?AM
Glauber
NGlauber
IDSQ, Live
HR
Dalton, Stahl
Gell-Mann
BR-3, Mendel
Pauli
Stahlp, Revolver
?Dendral
BR-4
IE
Coast, Phineas, AbE, Kekada
Mechem, CDP
Astra, GPM
Legend
5
Successes of Computational Scientific Discovery
Over the past decade, systems of this type have
helped discover new knowledge in many scientific
fields

qualitative chemical factors in mutagenesis (King
et al., 1996)
quantitative laws of metallic behavior (Sleeman
et al., 1997)
qualitative conjectures in number theory (Colton
et al., 2000)
temporal laws of ecological behavior (Todorovski
et al., 2000)
reaction pathways in catalytic chemistry
(Valdes-Perez, 1994)

Each has led to publications in the refereed
scientific literature (e.g., Langley, 2000), but
they did not focus on systems science.
6
The Nature of Systems Science
Disciplines like Earth science and computational
biology differ from traditional fields in that
they

focus on synthesis rather than analysis in their
operation
rely on computer modeling as one of their central
methods
develop system-level models with many variables
and relations
evaluate their models on observational, not
experimental, data.

Developing and testing such models are complex
tasks that would benefit from computational aids.
However, existing methods for computational
scientific discovery were not designed with
systems science in mind.
7
Observations from the Ross Sea
8
Inductive Process Modeling
Our response is to design, construct, and
evaluate computational methods for inductive
process modeling, which

represent scientific models as sets of
quantitative processes
use these models to predict and explain
observational data
search a space of process models to find good
candidates
utilize background knowledge to constrain this
search.

This framework has great potential for aiding
systems science, but it raises new computational
challenges.
9
Challenges of Inductive Process Modeling
Process model induction differs from typical
learning tasks in that

process models characterize behavior of dynamical
systems
variables are continuous but can have
discontinuous behavior
observations are not independently and
identically distributed
models may contain unobservable processes and
variables
multiple processes can interact to produce
complex behavior.

Compensating factors include a focus on
deterministic systems and the availability of
background knowledge.
10
Issue 1 Representing Scientific Models
To assist system scientists modeling efforts, we
must first encode candidate models that

address observational rather than experimental
data
deal with dynamic systems that change over time
have an explanatory rather than a descriptive
character
are causal in that they describe chains of
effects
contain quantitative relations and qualitative
structure.

We need some formal way to represent such models
that can be interpreted computationally.
11
Why Are Existing Formalisms Inadequate?
12
A Process Model for an Aquatic Ecosystem
model Ross_Sea_Ecosystem variables phyto,
nitro, residue, light, growth_rate,
effective_light, ice_factor observables phyto,
nitro, light, ice_factor process phyto_loss
equations dphyto,t,1 ? 0.1 ?
phyto dresidue,t,1 0.1 ? phyto process
phyto_growth equations dphyto,t,1
growth_rate ? phyto process phyto_uptakes_nitro
conditions nitro gt 0 equations dnitro,t,1
? 1 ? 0.204 ? growth_rate ? phyto process
growth_limitation equations growth_rate 0.23
? min(nitrate_rate, light_rate) process
nitrate_availability equations nitrate_rate
nitrate / (nitrate 5) process
light_availability equations light_rate
effective_light / (effective_light 50) process
light_attenuation equations effective_light
light ? ice_factor
13
Advantages of Quantitative Process Models
Process models are a good target for discovery
systems because

they embed quantitative relations within
qualitative structure
that refer to notations and mechanisms familiar
to scientists
they provide dynamical predictions of changes
over time
they offer causal and explanatory accounts of
phenomena
while retaining the modularity needed to support
induction.

Quantitative process models provide an important
alternative to formalisms used currently in
computational discovery.
14
Issue 2 Generating Predictions and Explanations
To utilize or evaluate a given process model, we
must simulate its behavior over time

specify initial values for input variables and
time step size
on each time step, determine which processes are
active
solve active algebraic/differential equations
with known values
propagate values and recursively solve other
active equations
when multiple processes influence the same
variable, assume their effects are additive.

This performance method makes specific
predictions that we can compare to observations.
15
Issue 3 Encoding Background Knowledge
To constrain candidate models, we can utilize
available backround knowledge about the domain.
Previous work has encoded background knowledge in
terms of

Horn clause programs (e.g., Towell Shavlik,
1990)
context-free grammars (e.g., Dzeroski
Todorovski, 1997)
prior probability distributions (e.g., Friedman
et al., 2000)

However, none of these notations are familiar to
domain scientists, which suggests the need for
another approach.
16
Generic Processes as Background Knowledge
Our framework casts background knowledge as
generic processes that specify

the variables involved in a process and their
types
the parameters appearing in a process and their
ranges
the forms of conditions on the process and
the forms of associated equations and their
parameters.

Generic processes are building blocks from which
one can compose a specific process model.
17
Generic Processes for Aquatic Ecosystems
generic process exponential_loss generic process
remineralization variables Sspecies,
Ddetritus variables Nnutrient,
Ddetritus parameters ? 0, 1 parameters
? 0, 1 equations dS,t,1 ?1 ? ? ? S
equations dN, t,1 ? ? D dD,t,1 ? ?
S dD, t,1 ?1 ? ? ? D generic process
grazing generic process constant_inflow
variables S1species, S2species, Ddetritus
variables Nnutrient parameters ? 0, 1, ?
0, 1 parameters ? 0, 1
equations dS1,t,1 ? ? ? ? S1
equations dN,t,1 ? dD,t,1 (1 ? ?) ? ? ?
S1 dS2,t,1 ?1 ? ? ? S1 generic process
nutrient_uptake variables Sspecies,
Nnutrient parameters ? 0, ?, ? 0, 1, ?
0, 1 conditions N gt ? equations dS,t,1
? ? S dN,t,1 ?1 ? ? ? ? ? S
18
Issue 4 Inducing Process Models
training data
process model
Induction
generic processes
19
A Method for Process Model Induction
We have implemented the IPM algorithm, which
induces process models from generic components in
four stages
1. Find all ways to instantiate known generic
processes with specific variables, subject to
type constraints 2. Combine instantiated
processes into candidate generic models subject
to additional constraints (e.g., number of
processes) 3. For each generic model, carry
out search through parameter space to find good
coefficients 4. Return the parameterized model
with the best overall score.
The evaluation metric can be squared error or
description length (e.g., MD (MV MC ) ? log
(n) n ? log (ME ) .
20
Estimating Parameters in Process Models
To estimate the parameters for each generic model
structure, the IPM algorithm
1. Selects random initial values that fall within
ranges specified in the generic processes 2.
Improves these parameters using the
Levenberg-Marquardt method until it reaches a
local optimum 3. Generates new candidate values
through random jumps along dimensions of the
parameter vector and continue search 4. If no
improvement occurs after N jumps, it restarts the
search from a new random initial point.
This multi-level method gives reasonable fits to
time-series data from a number of domains, but it
is computationally intensive.
21
More Issues in Process Model Induction
Inductive process modeling raises a number of
issues that have clear analogues in other
paradigms

identifying conditions on component processes
inferring initial values of unobservable
variables
keeping the structural search space tractable
reducing variance to mitigate overfitting effects

We have demonstrated promising responses to these
problems within the IPM framework.
22
Evaluation of the IPM Algorithm
To demonstrate IPM's ability to induce process
models, we ran it on synthetic data for a known
system
1. We used the aquatic ecosystem model to
generate data sets over 100 time steps for the
variables nitro and phyto 2. We replaced each
true value x with x ? (1 r ? n), where r
followed a Gaussian distribution (? 0, ? 1)
and n gt 0 3. We ran IPM on these noisy data,
giving it type constraints and generic processes
as background knowledge.
In two experiments, we let IPM determine the
initial values and thresholds given the correct
structure in a third study, we let it search
through a space of 256 generic model structures.
23
Experimental Results with IPM
The main results of our studies with IPM on
synthetic data were
1. The system infers accurate estimates for the
initial values of unobservable variables like zoo
and residue 2. The system induces estimates of
condition thresholds on nitro that are close to
the target values and 3. The MDL criterion
selects the correct model structure in all runs
with 5 noise, but only 40 of runs with 10
noise.
These suggest that the basic approach is sound,
but that we should consider more MDL schemes and
other responses to overfitting.
24
Observations from the Ross Sea
25
Results on Training Data from Ross Sea
26
Results on Test Data from Ross Sea
27
Collecting Data on Photosynthetic Processes
www.affymetrix.com/
Microarray Trace
/wwwscience.murdoch.edu.au/teach
Continuous Culture (Chemostat)
External stimuli (e.g., light)
Adaptation Period
Sampling mRNA/cDNA
Health of Culture
Equlibrium Period
www.affymetrix.com/
Time
28
Gene Expressions for Cyanobacteria
29
Generic Processes for Photosynthesis Regulation
generic process translation generic process
transcription variables Pprotein, MmRNA
variables MmRNA, Rrate parameters ? 0,
1 parameters equations dP,t,1 ? ? M
equations dM,t,1 R generic process
regulate_one generic process regulate_two
variables Rrate, Ssignal variables
Rrate, Ssignal parameters ? ?1 , 1
parameters ? ?1 , 1, ? 0, 1 equations R
? ? S equations R ? ? S dS, t,1 ?1
? ? ? S generic process automatic_degradation gen
eric process controlled_degradation variables
Cconcentration variables Dconcentration,
Econcentration conditions C gt 0
conditions D gt 0, E gt 0 parameters ? 0, 1
parameters ? 0, 1 equations dC,t,1 ?1 ?
? ? C equations dD,t,1 ?1 ? ? ?
E dE,t,1 ?1 ? ? ? E generic process
photosynthesis variables Llight, Pprotein,
Rredox, SROS parameters ? 0, 1, ? 0,
1 equations dR,t,1 ? ? L ? P dS,t,1
? ? L ? P
30
A Process Model for Photosynthetic Regulation
model photo_regulation variables light,
mRNA_protein, ROS, redox, transcription_rate obser
vables light, mRNA process photosynthesis
equations dredox,t,1 0.0155 ? light ?
protein dROS,t,1 0.019 ? light ?
protein process protein_translation process
mRNA_transcription equations dprotein,t,1
7.54 ? mRNA equations dmRNA,t,1
transcription_rate process regulate_one_1 process
regulate_two_2 equations transcription_rate
0.99 ? light equations transcription_rate
1.203 ? redox dredox,t,1 ? 0.0002 ?
redox process automatic_degradation_1 process
controlled_degradation_1 conditions protein gt
0 conditions redox gt 0, ROS gt 0
equations dprotein,t,1 ? 1.91 ? protein
equations dredox,t,1 ? 0.0003 ?
ROS dROS,t,1 ? 0.0003 ? ROS
31
Predictions from Best Parameterized Model
32
Electric Power on the International Space Station
33
Results on Battery Test Data
34
Results on Data from Rinkobing Fjord
35
Issue 5 Interfacing with Scientists
Because few scientists want to be replaced, we
are developing an interactive environment that
lets users

specify a quantitative process model of the
target system
display and edit the models structure and
details graphically
simulate the models behavior over time and
situations
compare the models predicted behavior to
observations
invoke a revision module in response to detected
anomalies.

The environment offers computational assistance
in forming and evaluating models but lets the
user retain control.
36
Viewing and Editing a Process Model
37
Results of Revising the NPP Model
Initial model E 0.56 T1 T2 W
T2 1.18 / (1 e 0.2 (Topt Tempc 10) )
(1 e 0.3 (Tempc Topt 10) ) PET
1.6 (10 Tempc / AHI)A PET-TW-M SR ?
3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05,
4.05, 5.09, 4.05 RMSE on training data 465.212
and r 2 0.799 Revised model E 0.353
T10.00 T2 0.08 W 0.00 T2 0.83 / (1
e 1.0 (Topt Tempc 6.34) ) (1 e 1.0
(Tempc Topt 11.52) ) PET 1.6 (10
Tempc / AHI) A PET-TW-M SR ? 0.61, 3.99,
2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85,
1.61 Cross-validated RMSE 397.306 and r 2
0.853 15 reduction

38
Intellectual Influences
Our approach to computational discovery
incorporates ideas from many traditions

computational scientific discovery (e.g., Langley
et al., 1983)
theory revision in machine learning (e.g.,
Towell, 1991)
qualitative physics and simulation (e.g., Forbus,
1984)
languages for scientific simulation (e.g.,
STELLA, MATLAB)
interactive tools for data analysis (e.g.,
Schneiderman, 2001).

Our work combines, in novel ways, insights from
machine learning, AI, programming languages, and
human-computer interaction.
39
Contributions of the Research
In summary, our work on computational scientific
discovery has, in responding to various
challenges, produced

a new formalism for representing scientific
process models
a computational method for simulating these
models behavior
an encoding for background knowledge as generic
processes
an algorithm for inducing process models from
time-series data
an interactive environment for model
construction/utilization.

We have demonstrated this approach to model
creation on domains from Earth science,
microbiology, and engineering.
40
Directions for Future Research
Despite our progress to date, we need further
work in order to

produce additional results on other scientific
data sets
develop improved methods for fitting model
parameters
extend the approach to handle data sets with
missing values
implement heuristic methods for searching the
structure space
utilize knowledge of subsystems to further
constrain search
augment the modeling environment to make it more
usable

Inductive process modeling has great potential to
speed progress in systems science and
engineering.
41
End of Presentation

Write a Comment

User Comments (0)

About PowerShow.com

Computational Discovery of Communicable Knowledge - PowerPoint PPT Presentation

Computational Discovery of Communicable Knowledge

induce predictive models from large, often business, data sets; ... to model creation on domains from Earth science, microbiology, and engineering. ... – PowerPoint PPT presentation