Computational Discovery of Communicable Knowledge - PowerPoint PPT Presentation

About This Presentation
Title:

Computational Discovery of Communicable Knowledge

Description:

induce predictive models from large, often business, data sets; ... to model creation on domains from Earth science, microbiology, and engineering. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 42
Provided by: Lang8
Learn more at: http://www.isle.org
Category:

less

Transcript and Presenter's Notes

Title: Computational Discovery of Communicable Knowledge


1
Computational Discovery of Explanatory Process
Models
Pat Langley Center for the Study of Language and
Information Stanford University, Stanford,
California http//cll.stanford.edu/langley langle
y_at_csli.stanford.edu
Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, A.
Pohorille, J. Sanchez, K. Saito, and J. Shrager
for their contributions to this research.
2
Data Mining vs. Scientific Discovery
There exist two computational paradigms for
discovering explicit knowledge from data. The
data mining movement develops computational
methods that
  • induce predictive models from large, often
    business, data sets
  • cast models as decision trees, logical rules, or
    other notations invented by AI researchers.

In contrast, computational scientific discovery
focuses on
  • constructing models from (often small) scientific
    data sets
  • stated in formalisms invented by scientists and
    engineers.

Both approaches draw on heuristic search to find
regularities in data, but they differ
considerably in their emphases.
3
In Memoriam
Three years ago, computational scientific
discovery lost two of its founding fathers
  • Herbert A. Simon (1916 2001)
  • Jan M. Zytkow (1945 2001)

Both contributed to the field in many ways
posing new problems, inventing methods, training
students, and organizing meetings. Moreover, both
were interdisciplinary researchers who
contributed to computer science, psychology,
philosophy, and statistics. Herb Simon and Jan
Zytkow were excellent role models who we should
all aim to emulate.
4
Time Line for Research on Computational
Scientific Discovery
1989
1990
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Bacon.1Bacon.5
Abacus, Coper
Fahrehneit, E, Tetrad, IDSN
Hume, ARC
DST, GPN LaGrange
SDS
SSF, RF5, LaGramge
RL, Progol
?AM
Glauber
NGlauber
IDSQ, Live
HR
Dalton, Stahl
Gell-Mann
BR-3, Mendel
Pauli
Stahlp, Revolver
?Dendral
BR-4
IE
Coast, Phineas, AbE, Kekada
Mechem, CDP
Astra, GPM
Legend
5
Successes of Computational Scientific Discovery
Over the past decade, systems of this type have
helped discover new knowledge in many scientific
fields
  • qualitative chemical factors in mutagenesis (King
    et al., 1996)
  • quantitative laws of metallic behavior (Sleeman
    et al., 1997)
  • qualitative conjectures in number theory (Colton
    et al., 2000)
  • temporal laws of ecological behavior (Todorovski
    et al., 2000)
  • reaction pathways in catalytic chemistry
    (Valdes-Perez, 1994)

Each has led to publications in the refereed
scientific literature (e.g., Langley, 2000), but
they did not focus on systems science.
6
The Nature of Systems Science
Disciplines like Earth science and computational
biology differ from traditional fields in that
they
  • focus on synthesis rather than analysis in their
    operation
  • rely on computer modeling as one of their central
    methods
  • develop system-level models with many variables
    and relations
  • evaluate their models on observational, not
    experimental, data.

Developing and testing such models are complex
tasks that would benefit from computational aids.
However, existing methods for computational
scientific discovery were not designed with
systems science in mind.
7
Observations from the Ross Sea
8
Inductive Process Modeling
Our response is to design, construct, and
evaluate computational methods for inductive
process modeling, which
  • represent scientific models as sets of
    quantitative processes
  • use these models to predict and explain
    observational data
  • search a space of process models to find good
    candidates
  • utilize background knowledge to constrain this
    search.

This framework has great potential for aiding
systems science, but it raises new computational
challenges.
9
Challenges of Inductive Process Modeling
Process model induction differs from typical
learning tasks in that
  • process models characterize behavior of dynamical
    systems
  • variables are continuous but can have
    discontinuous behavior
  • observations are not independently and
    identically distributed
  • models may contain unobservable processes and
    variables
  • multiple processes can interact to produce
    complex behavior.

Compensating factors include a focus on
deterministic systems and the availability of
background knowledge.
10
Issue 1 Representing Scientific Models
To assist system scientists modeling efforts, we
must first encode candidate models that
  • address observational rather than experimental
    data
  • deal with dynamic systems that change over time
  • have an explanatory rather than a descriptive
    character
  • are causal in that they describe chains of
    effects
  • contain quantitative relations and qualitative
    structure.

We need some formal way to represent such models
that can be interpreted computationally.
11
Why Are Existing Formalisms Inadequate?
12
A Process Model for an Aquatic Ecosystem
model Ross_Sea_Ecosystem variables phyto,
nitro, residue, light, growth_rate,
effective_light, ice_factor observables phyto,
nitro, light, ice_factor process phyto_loss
equations dphyto,t,1 ? 0.1 ?
phyto dresidue,t,1 0.1 ? phyto process
phyto_growth equations dphyto,t,1
growth_rate ? phyto process phyto_uptakes_nitro
conditions nitro gt 0 equations dnitro,t,1
? 1 ? 0.204 ? growth_rate ? phyto process
growth_limitation equations growth_rate 0.23
? min(nitrate_rate, light_rate) process
nitrate_availability equations nitrate_rate
nitrate / (nitrate 5) process
light_availability equations light_rate
effective_light / (effective_light 50) process
light_attenuation equations effective_light
light ? ice_factor
13
Advantages of Quantitative Process Models
Process models are a good target for discovery
systems because
  • they embed quantitative relations within
    qualitative structure
  • that refer to notations and mechanisms familiar
    to scientists
  • they provide dynamical predictions of changes
    over time
  • they offer causal and explanatory accounts of
    phenomena
  • while retaining the modularity needed to support
    induction.

Quantitative process models provide an important
alternative to formalisms used currently in
computational discovery.
14
Issue 2 Generating Predictions and Explanations
To utilize or evaluate a given process model, we
must simulate its behavior over time
  • specify initial values for input variables and
    time step size
  • on each time step, determine which processes are
    active
  • solve active algebraic/differential equations
    with known values
  • propagate values and recursively solve other
    active equations
  • when multiple processes influence the same
    variable, assume their effects are additive.

This performance method makes specific
predictions that we can compare to observations.
15
Issue 3 Encoding Background Knowledge
To constrain candidate models, we can utilize
available backround knowledge about the domain.
Previous work has encoded background knowledge in
terms of
  • Horn clause programs (e.g., Towell Shavlik,
    1990)
  • context-free grammars (e.g., Dzeroski
    Todorovski, 1997)
  • prior probability distributions (e.g., Friedman
    et al., 2000)

However, none of these notations are familiar to
domain scientists, which suggests the need for
another approach.
16
Generic Processes as Background Knowledge
Our framework casts background knowledge as
generic processes that specify
  • the variables involved in a process and their
    types
  • the parameters appearing in a process and their
    ranges
  • the forms of conditions on the process and
  • the forms of associated equations and their
    parameters.

Generic processes are building blocks from which
one can compose a specific process model.
17
Generic Processes for Aquatic Ecosystems
generic process exponential_loss generic process
remineralization variables Sspecies,
Ddetritus variables Nnutrient,
Ddetritus parameters ? 0, 1 parameters
? 0, 1 equations dS,t,1 ?1 ? ? ? S
equations dN, t,1 ? ? D dD,t,1 ? ?
S dD, t,1 ?1 ? ? ? D generic process
grazing generic process constant_inflow
variables S1species, S2species, Ddetritus
variables Nnutrient parameters ? 0, 1, ?
0, 1 parameters ? 0, 1
equations dS1,t,1 ? ? ? ? S1
equations dN,t,1 ? dD,t,1 (1 ? ?) ? ? ?
S1 dS2,t,1 ?1 ? ? ? S1 generic process
nutrient_uptake variables Sspecies,
Nnutrient parameters ? 0, ?, ? 0, 1, ?
0, 1 conditions N gt ? equations dS,t,1
? ? S dN,t,1 ?1 ? ? ? ? ? S
18
Issue 4 Inducing Process Models
training data
process model
Induction
generic processes
19
A Method for Process Model Induction
We have implemented the IPM algorithm, which
induces process models from generic components in
four stages
1. Find all ways to instantiate known generic
processes with specific variables, subject to
type constraints 2. Combine instantiated
processes into candidate generic models subject
to additional constraints (e.g., number of
processes) 3. For each generic model, carry
out search through parameter space to find good
coefficients 4. Return the parameterized model
with the best overall score.
The evaluation metric can be squared error or
description length (e.g., MD (MV MC ) ? log
(n) n ? log (ME ) .
20
Estimating Parameters in Process Models
To estimate the parameters for each generic model
structure, the IPM algorithm
1. Selects random initial values that fall within
ranges specified in the generic processes 2.
Improves these parameters using the
Levenberg-Marquardt method until it reaches a
local optimum 3. Generates new candidate values
through random jumps along dimensions of the
parameter vector and continue search 4. If no
improvement occurs after N jumps, it restarts the
search from a new random initial point.
This multi-level method gives reasonable fits to
time-series data from a number of domains, but it
is computationally intensive.
21
More Issues in Process Model Induction
Inductive process modeling raises a number of
issues that have clear analogues in other
paradigms
  • identifying conditions on component processes
  • inferring initial values of unobservable
    variables
  • keeping the structural search space tractable
  • reducing variance to mitigate overfitting effects

We have demonstrated promising responses to these
problems within the IPM framework.
22
Evaluation of the IPM Algorithm
To demonstrate IPM's ability to induce process
models, we ran it on synthetic data for a known
system
1. We used the aquatic ecosystem model to
generate data sets over 100 time steps for the
variables nitro and phyto 2. We replaced each
true value x with x ? (1 r ? n), where r
followed a Gaussian distribution (? 0, ? 1)
and n gt 0 3. We ran IPM on these noisy data,
giving it type constraints and generic processes
as background knowledge.
In two experiments, we let IPM determine the
initial values and thresholds given the correct
structure in a third study, we let it search
through a space of 256 generic model structures.
23
Experimental Results with IPM
The main results of our studies with IPM on
synthetic data were
1. The system infers accurate estimates for the
initial values of unobservable variables like zoo
and residue 2. The system induces estimates of
condition thresholds on nitro that are close to
the target values and 3. The MDL criterion
selects the correct model structure in all runs
with 5 noise, but only 40 of runs with 10
noise.
These suggest that the basic approach is sound,
but that we should consider more MDL schemes and
other responses to overfitting.
24
Observations from the Ross Sea
25
Results on Training Data from Ross Sea
26
Results on Test Data from Ross Sea
27
Collecting Data on Photosynthetic Processes
www.affymetrix.com/
Microarray Trace
/wwwscience.murdoch.edu.au/teach
Continuous Culture (Chemostat)
External stimuli (e.g., light)
Adaptation Period
Sampling mRNA/cDNA
Health of Culture
Equlibrium Period
www.affymetrix.com/
Time
28
Gene Expressions for Cyanobacteria
29
Generic Processes for Photosynthesis Regulation
generic process translation generic process
transcription variables Pprotein, MmRNA
variables MmRNA, Rrate parameters ? 0,
1 parameters equations dP,t,1 ? ? M
equations dM,t,1 R generic process
regulate_one generic process regulate_two
variables Rrate, Ssignal variables
Rrate, Ssignal parameters ? ?1 , 1
parameters ? ?1 , 1, ? 0, 1 equations R
? ? S equations R ? ? S dS, t,1 ?1
? ? ? S generic process automatic_degradation gen
eric process controlled_degradation variables
Cconcentration variables Dconcentration,
Econcentration conditions C gt 0
conditions D gt 0, E gt 0 parameters ? 0, 1
parameters ? 0, 1 equations dC,t,1 ?1 ?
? ? C equations dD,t,1 ?1 ? ? ?
E dE,t,1 ?1 ? ? ? E generic process
photosynthesis variables Llight, Pprotein,
Rredox, SROS parameters ? 0, 1, ? 0,
1 equations dR,t,1 ? ? L ? P dS,t,1
? ? L ? P
30
A Process Model for Photosynthetic Regulation
model photo_regulation variables light,
mRNA_protein, ROS, redox, transcription_rate obser
vables light, mRNA process photosynthesis
equations dredox,t,1 0.0155 ? light ?
protein dROS,t,1 0.019 ? light ?
protein process protein_translation process
mRNA_transcription equations dprotein,t,1
7.54 ? mRNA equations dmRNA,t,1
transcription_rate process regulate_one_1 process
regulate_two_2 equations transcription_rate
0.99 ? light equations transcription_rate
1.203 ? redox dredox,t,1 ? 0.0002 ?
redox process automatic_degradation_1 process
controlled_degradation_1 conditions protein gt
0 conditions redox gt 0, ROS gt 0
equations dprotein,t,1 ? 1.91 ? protein
equations dredox,t,1 ? 0.0003 ?
ROS dROS,t,1 ? 0.0003 ? ROS
31
Predictions from Best Parameterized Model
32
Electric Power on the International Space Station
33
Results on Battery Test Data
34
Results on Data from Rinkobing Fjord
35
Issue 5 Interfacing with Scientists
Because few scientists want to be replaced, we
are developing an interactive environment that
lets users
  • specify a quantitative process model of the
    target system
  • display and edit the models structure and
    details graphically
  • simulate the models behavior over time and
    situations
  • compare the models predicted behavior to
    observations
  • invoke a revision module in response to detected
    anomalies.

The environment offers computational assistance
in forming and evaluating models but lets the
user retain control.
36
Viewing and Editing a Process Model
37
Results of Revising the NPP Model
Initial model E 0.56 T1 T2 W
T2 1.18 / (1 e 0.2 (Topt Tempc 10) )
(1 e 0.3 (Tempc Topt 10) ) PET
1.6 (10 Tempc / AHI)A PET-TW-M SR ?
3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05,
4.05, 5.09, 4.05 RMSE on training data 465.212
and r 2 0.799 Revised model E 0.353
T10.00 T2 0.08 W 0.00 T2 0.83 / (1
e 1.0 (Topt Tempc 6.34) ) (1 e 1.0
(Tempc Topt 11.52) ) PET 1.6 (10
Tempc / AHI) A PET-TW-M SR ? 0.61, 3.99,
2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85,
1.61 Cross-validated RMSE 397.306 and r 2
0.853 15 reduction



38
Intellectual Influences
Our approach to computational discovery
incorporates ideas from many traditions
  • computational scientific discovery (e.g., Langley
    et al., 1983)
  • theory revision in machine learning (e.g.,
    Towell, 1991)
  • qualitative physics and simulation (e.g., Forbus,
    1984)
  • languages for scientific simulation (e.g.,
    STELLA, MATLAB)
  • interactive tools for data analysis (e.g.,
    Schneiderman, 2001).

Our work combines, in novel ways, insights from
machine learning, AI, programming languages, and
human-computer interaction.
39
Contributions of the Research
In summary, our work on computational scientific
discovery has, in responding to various
challenges, produced
  • a new formalism for representing scientific
    process models
  • a computational method for simulating these
    models behavior
  • an encoding for background knowledge as generic
    processes
  • an algorithm for inducing process models from
    time-series data
  • an interactive environment for model
    construction/utilization.

We have demonstrated this approach to model
creation on domains from Earth science,
microbiology, and engineering.
40
Directions for Future Research
Despite our progress to date, we need further
work in order to
  • produce additional results on other scientific
    data sets
  • develop improved methods for fitting model
    parameters
  • extend the approach to handle data sets with
    missing values
  • implement heuristic methods for searching the
    structure space
  • utilize knowledge of subsystems to further
    constrain search
  • augment the modeling environment to make it more
    usable

Inductive process modeling has great potential to
speed progress in systems science and
engineering.
41
End of Presentation
Write a Comment
User Comments (0)
About PowerShow.com