Title: Causal Inference and Graphical Models
1Causal Inference and Graphical Models
- Peter Spirtes
- Carnegie Mellon University
2Overview
- Manipulations
- Assuming no Hidden Common Causes
- From DAGs to Effects of Manipulation
- From Data to Sets of DAGs
- From Sets of Dags to Effects of Manipulation
- May be Hidden Common Causes
- From Data to Sets of DAGs
- From Sets of DAGs to Effects of Manipulations
3If I were to force a group of people to smoke one
pack a day, what what percentage would develop
lung cancer?
The Evidence
4P(Lung cancer yes) 1/2
5Conditioning on Teeth white yes
P(Lung Cancer yesTeeth white yes) 1/4
6Manipulating Teeth white yes
7Manipulating Teeth white yes - After Waiting
P(Lung Cancer yes White teeth yes) 1/2
?
P(Lung Cancer yesWhite teeth yes) 1/4
8Smoking Decision
- Setting insurance rates for smokers -
conditioning - Suppose the Surgeon General is considering
banning smoking? - Will this decrease smoking?
- Will decreasing smoking decrease cancer?
- Will it have negative side-effects e.g. more
obesity? - How is greater life expectancy valued against
decrease in pleasure from smoking?
9Manipulations and Distributions
- Since Smoking determines Teeth white, P(T,L,R,W)
P(S,L,R,W) - But the manipulation of Teeth white leads to
different results than the manipulation of
Smoking - Hence the distribution does not always uniquely
determine the results of a manipulation
10Causation
- We will infer average causal effects.
- We will not consider quantities such as
probability of necessity, probability of
sufficiency, or the counterfactual probability
that I would get a headache conditional on taking
an aspirin, given that I did not take an aspirin - The causal relations are between properties of a
unit at a time, not between events. - Each unit is assumed to be causally isolated.
- The causal relations may be genuinely
indeterministic, or only apparently
indeterministic.
11Causal DAGs
- Probabilistic Interpretation of DAGs
- A DAG represents a distribution P when each
variable is independent of its non-descendants
conditional on its parents in the DAG - Causal Interpretation of DAGs
- There is a directed edge from A to B (relative to
V) when A is a direct cause of B. - An acyclic graph is not a representation of
reversible or feedback processes
12Conditioning
- Conditioning maps a probability distribution and
an event into a new probability distribution - f(P(V),e) ? P(V), where P(Vv) P(Vv)/P(e)
13Manipulating
- A manipulation maps a population joint
probability distribution, a causal DAG, and a set
of new probability distributions for a set of
variables, into a new joint distribution - Manipulating for X1,,Xn ? V
- f P(V), population distribution
- G, causal DAG
- P(X1Non-Descendants(G,X1)),,
manipulated variables - P(XnNon-Descendants(G,Xn))
- ?
- P(V) manipulated distribution
- (assumption that manipulations are
independent) -
14Manipulation Notation - Adapting Lauritzen
- The distribution of Lung Cancer given the
manipulated distribution of Smoking - P(Lung CancerP(Smoking))
- The distribution of Lung Cancer conditional on
Radon given the manipulated distribution of
Smoking - P(Lung CancerRadonP(Smoking))
- P(Lung Cancer,RadonP(Smoking))/
P(RadonP(Smoking)) - First manipulate, then condition
15Ideal Manipulations
- No fat hand
- Effectiveness
- Whether or not any actual action is an ideal
manipulation of a variable Z is not part of the
theory - it is input to the theory. - With respect to a system of variables containing
murder rates, outlawing cocaine is not an ideal
manipulation of cocaine usage - It is not entirely effective - people still use
cocaine - It affects murder rates directly, not via its
effect on cocaine usage, because of increased
gang warfare
163 Representations of Manipulations
- Structural Equation
- Policy Variable
- Potential Outcomes
17College Plans
- Sewell and Shah (1968) studied five variables
from a sample of 10,318 Wisconsin high school
seniors. - SEX male 0, female 1
- IQ Intelligence Quotient, lowest 0, highest
3 - CP college plans yes 0, no 1
- PE parental encouragement low 0, high 1
- SES socioeconomic status lowest 0, highest
3
18College Plans - A Hypothesis
SES SEX PE
CP IQ
19Equational Representation
- xi f(pai(G), ei)
- If the ei are causes of two or more variables,
they must be included in the analysis - There is a distribution over the ei
- The equations and the distribution over the ei
determine a distribution over the xi - When manipulating variable to a value, replace
with xi c
20Policy Variable Representation
- P(PE,SES,SEX,IQ,CPpolicy off)
- P(PE1policy on) 1
- P(SES,SEX,IQ,CP,PE1policyon)
- P(CPPEpolicy on)
- P(PE,SES,SEX,IQ,CP)
- Suppose P(PE1)1
- P(SES,SEX,IQ,CP,PE1P(PE))
- P(CPPEP(PE))
Pre-manipulation
Post-manipulation
21From DAG to Effects of Manipulation
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
22Causal Sufficiency
- A set of variables is causally sufficient if
every cause of two variables in the set is also
in the set. - PE,CP,SES is causally sufficient
- IQ,CP,SES is not causally sufficient.
23Causal Markov Assumption
- For a causally sufficient set of variables, the
joint distribution is the product of each
variable conditional on its parents in the causal
DAG. - P(SES,SEX,PE,CP,IQ) P(SES)P(SEX)P(IQSES)P(PESE
S,SEX,IQ)P(CPPE)
24Equivalent Forms of Causal Markov Assumption
- In the population distribution, each variable is
independent of its non-descendants in the causal
DAG (non-effects) conditional on its parents
(immediate causes). - If X is d-separated from Y conditional on Z
(written as ltX,YZgt) in the causal graph, then X
is independent of Y conditional on Z in the
population distribution) denoted I(X,YZ)).
25Causal Markov Assumption
- Causal Markov implies that if X is d-separated
from Y conditional on Z in the causal DAG, then X
is independent of Y conditional on Z. - Causal Markov is equivalent to assuming that the
causal DAG represents the population
distribution. - What would a failure of Causal Markov look like?
If X and Y are dependent, but X does not cause Y,
Y does not cause X, and no variable Z causes both
X and Y.
26Causal Markov Assumption
- Assumes that no unit in the population affects
other units in the population - If the natural units do affect each other, the
units should be re-defined to be aggregations of
units that dont affect each other - For example, individual people might be
aggregated into families - Assumes variables are not logically related, e.g.
x and x2 - Assumes no feedback
27Manipulation Theorem - No Hidden Variables
- P(PE,SES,SEX,CP,IQP(PE))
- P(PE)P(SEX)P(CPPE,SES,IQ)P(IQSES)P(PEpolicyon)
- P(PE)P(SEX)P(CPPE,SES,IQ)P(IQSES)P(PE)
SES
SES
Policy
SEX
PE
CP
IQ
28Invariance
- Note that P(CPPE,SES,IQ,policy on)
P(CPPE,SES,IQ,policy off) because the policy
variable is d-separated from CP conditional on
PE,SES,IQ - We say that P(CPPE,SES,IQ) is invariant
- An invariant quantity can be estimated from the
pre-manipulation distribution - This is equivalent to one of the rules of the Do
Calculus and can also be applied to latent
variable models
IQ
29Calculating Effects
IQ
30From Sample to Sets of DAGs
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
31From Sample to Population to DAGs
- Constraint - Based
- Uses tests of conditional independence
- Goal Find set of DAGs whose d-separation
relations match most closely the results of
conditional independenc tests
- Score - Based
- Uses scores such as Bayesian Information
Criterion or Bayesian posterior - Goal Maximize score
32Two Kinds Of Search
Constraint Score
Use non conditional independence information No Yes
Quantitative comparison of models No Yes
Single test result leads astray Yes No
Easy to apply to latent Yes No
33Bayesian Information Criterion
- D is the sample data
- G is a DAG
- is the vector of maximum likelihood
estimates of the parameters for DAG G - N is the sample size
- d is the dimensionality of the model, which in
DAGs without latent variables is simply the
number of free parameters in the model
343 Kinds of Alternative Causal Models
SES
SES
SES
SES
SEX
PE
CP
SEX
PE
CP
IQ
IQ
True Model Alternative 1
SES
SES
SES
SES
SEX
PE
SEX
CP
PE
CP
IQ
IQ
Alternative 3 Alternative 2
35Alternative Causal Models
SES
SES
SES
SES
SEX
PE
CP
SEX
PE
CP
IQ
IQ
True Model Alternative 1
- Constraint - Based Alternative 1 violates Causal
Markov Assumption by entailing that SES and IQ
are independent - Score - Based Use a score that prefers a model
that contains the true distribution over one that
does not.
36Alternative Causal Models
SES
SES
SES
SES
SEX
PE
CP
SEX
PE
CP
IQ
IQ
True Model Alternative 2
- Constraint - Based Assume that if Sex and CP are
independent (conditional on some subset of
variables such as PE, SES, and IQ) then Sex and
CP are adjacent - Causal Adjacency Faithfulness
Assumption. - Score - Based Use a score such that if two
models contain the true distribution, choose the
one with fewer parameters. The True Model has
fewer parameters.
37Both Assumptions Can Be False
Independence holds only for parameters on lower
dimensional surface - Lebesgue measure 0
Independence holds for all values of parameters
Alternative 2
True Model
38When Not to Assume Faithfulness
- Deterministic relationships between variables
entail extra conditional independence
relations, in addition to those entailed by the
global directed Markov condition. - If A ? B ? C, and B A, and C B, then not only
I(A,CB), which is entailed by the global
directed Markov condition, but also I(B,CA),
which is not. - The deterministic relations are theoretically
detectible, and when present, faithfulness should
not be assumed. - Do not assume in feedback systems in equilibrium.
39Alternative Causal Models
SES
SES
SES
SES
SEX
PE
SEX
CP
PE
CP
IQ
IQ
True Model Alternative 3
- Constraint - Based Alternative 2 entails the
same set of conditional independence relations -
there is no principled way to choose.
40Alternative Causal Models
SES
SES
SES
SES
SEX
PE
SEX
CP
PE
CP
IQ
IQ
True Model Alternative 2
- Score - Based Whether or not one can choose
depends upon the parametric family. - For unrestricted discrete, or linear Gaussian,
there is no way to choose - the BIC scores will
be the same. - For linear non-Gaussian, the True Model will be
preferred (because while the two models entail
the same second order moments, they entail
different fourth order moments.)
41Patterns
- A pattern (or p-dag) represents a set of DAGs
that all have the same d-separation relations,
i.e. a d-separation equivalence class of DAGs. - The adjacencies in a pattern are the same as the
adjacencies in each DAG in the d-separation
equivalence class. - An edge is oriented as A ? B in the pattern if it
is oriented as A ? B in every DAG in the
equivalence class. - An edge is oriented as A ? B in the pattern if
the edge is oriented as A ? B in some DAGs in the
equivalence class, and as A ? B in other DAGs in
the equivalence class.
42Patterns to Graphs
- All of the DAGs in a d-separation equivalence
class can be derived from the pattern that
represents the d-separation equivalence class by
orienting the unoriented edges in the pattern. - Every orientation of the unoriented edges is
acceptable as long as it creates no new
unshielded colliders. - That is A ? B ? C can be oriented as A ? B? C, A
? B ? C, or A ? B ? C, but not as A ? B ? C.
43Patterns
SES
SES
SEX
PE
CP
IQ
D-separation Equivalence Class
SES
SES
SEX
PE
CP
IQ
Pattern
44Search Methods
- Constraint Based
- PC (correct in limit)
- Variants of PC (correct in limit, better on small
sample sizes) - Score - Based
- Greedy hill climbing
- Simulated annealing
- Genetic algorithms
- Greedy Equivalence Search (correct in limit)
45From Sets of DAGs to Effects of Manipulation
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
46Causal Inference in Patterns
- Is P(IQ) invariant when SES is manipulated to a
constant? Cant tell. - If SES ? IQ, then policy is d-connected to IQ
given empty set - no invariance. - If SES ? IQ, then policy is not d-connected to IQ
given empty set - invariance. -
SES
SES
?
policy
SEX
PE
CP
IQ
47Causal Inference in Patterns
- Different DAGs represented by pattern give
different answers as to the effect of
manipulating SES on IQ - not identifiable. - In these cases, should ouput cant tell.
- Note the difference from using Bayesian networks
for classification - we can use either DAG
equally well for correct classification, but we
have to know which one is true for correct
inference about the effect of a manipulation. -
SES
SES
?
policy
SEX
PE
CP
IQ
48Causal Inference in Patterns
- Is P(CPPE,SES,IQ) invariant when PE is
manipulated to a constant? Can tell. - policy variable is d-separated from CP given PE,
SES, IQ regardless of which way the edge points -
invariance in every DAG represented by the
pattern. -
SES
SES
?
SEX
PE
CP
policy
IQ
49College Plans
not invariant, but is identifiable
invariant
50Good News
In the large sample limit, there are algorithms
(PC, Greedy Equivalence Search) that are
arbitrarily close to correct (or output cant
tell) with probability 1 (pointwise consistency).
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
51Bad News
At every finite sample size, every method will be
far from truth with high probability for some
values of the truth (no uniform consistency.)
(Typically not true of classification problems.)
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
52Why Bad News?
The problem - small differences in population
distribution can lead to big changes in inference
to causal DAGs.
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
53Strengthening Faithfulness Assumption
- Strong versus weak
- Weak adjacency faithfulness assumes a zero
conditional dependence between X and Y entails a
zero-strength edge between X and Y - Strong adjacency faithfulness assumes in addition
that a weak conditional dependence between X and
Y entails a weak-strength edge between X and Y - Under this assumption, there are uniform
consistent estimators of the effects of
manipulations.
54Obstacles to Causal Inference from
Non-experimental Data
- unmeasured confounders
- measurement error, or discretization of data
- mixtures of different causal structures in the
sample - feedback
- reversibility
- the existence of a number of models that fit the
data equally well - an enormous search space
- low power of tests of independence conditional on
large sets of variables - selection bias
- missing values
- sampling error
- complicated and dense causal relations among sets
of variables, - complcated probability distributions
55From Data to Sets of DAGs - Possible Hidden
Variables
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
56Why Latent Variable Models?
- For classification problems, introducing latent
variables can help get closer to the right answer
at smaller sample sizes - but they are needed to
get the right answer in the limit. - For causal inference problems, introducing latent
variables are needed to get the right answer in
the limit.
57Score-Based Search Over Latent Models
- Structural EM interleaves estimation of
parameters with structural search - Can also search over latent variable models by
calculating posteriors - But there are substantial computational and
statistical problems with latent variable models
58DAG Models with Latent Variables
- Facilitates construction of causal models
- Provides a finite search space
- Nice statistical properties
- Always identified
- Correspond to a set of distributions
characterized by independence relations - Have a well-defined dimension
- Asymptotic existence of ML estimates
59Solution
- Embed each latent variable model in a larger
model without latent variables that is easier to
characterize. - Disadvantage - uses only conditional independence
information in the distribution.
Latent variable model
Model imposing only independence constraints on
observed variables
Sets of distributions
60Alternative Hypothesis and Some D-separations
SES
SEX PE
CP L1
L2 IQ
ltL2,SES,L1,SEX, PE?gt ltSEX,L1,SES,L2,IQ?gt ltL1
,SES,L2,SEX?gt ltSEX,CPPE,SES) These entail
conditional independence relations in population.
ltCP,IQ,L1,SEXL2,PE,SESgt ltPE,IQ,L2L1,SEX,
SESgt ltIQ,SEX,PE,CPL1,L2,SESgt ltSES,SEX,IQ,L1
,L2?gt
61D-separations Among Observed
SES
SEX PE
CP L1
L2 IQ
ltL2,SES,L1,SEX, PE?gt ltSEX,L1,SES,L2,IQ?gt ltL1
,SES,L2,SEX?gt ltSEX,CPPE,SES)
ltCP,IQ,L1,SEXL2,PE,SESgt ltPE,IQ,L2L1,SEX,
SESgt ltIQ,SEX,PE,CPL1,L2,SESgt ltSES,SEX,IQ,L1
,L2?gt
62D-separations Among Observed
SES
SEX PE
CP L1
L2 IQ
It can be shown that no DAG with just the
measured variables has exactly the set of
d-separation relations among the observed
variables. In this sense, DAGs are not closed
under marginalization.
63Mixed Ancestral Graphs
- Under a natural extension of the concept of
d-separation to graphs with ?, MAG(G) is a
graphical object that contains only the observed
variables, and has exactly the d-separations
among the observed variables.
SES
SEX PE
CP IQ
SES
SEX PE
CP IQ
L1
L2
Latent Variable DAG Corresponding MAG
64Mixed Ancestral Graph Construction
- There is an edge between A and B if and only if
for every ltA,BCgt, there is a latent variable
in C. - If A and B are adjacent, then A ? B if and only
if A is an ancestor of B. - If A and B are adjacent, then A ? B if and only
if A is not an ancestor of B and B is not an
ancestor of A.
65Suppose SES Unmeasured
SEX PE
CP IQ
SES
SEX PE
CP IQ
L1
L2
DAG
Corresponding MAG
SEX PE
CP IQ
Another DAG with the same MAG
L1
L2
66Mixed Ancestral Models
- Can score and evaluate in the usual ways
- Not every parameter is directly interpreted as a
structural (causal) coefficient - Not every part of marginal manipulated model can
be predicted from mixed ancestral graph - Because multiple DAGs can have the same MAG, they
might not all agree on the effect of a
manipulation. - It is possible to tell from the MAG when all of
the DAGs with that MAG all agree on the effect of
a manipulation.
67Mixed Ancestral Graph
- Mixed ancestral models are closed under
marginalization. - In the linear normal case, the parameterization
of a MAG is just a special case of the
parameterization of a linear structural equation
model. - There is a maximum liklihood estimator of the
parameters (Drton). - The BIC score is easy to calculate.
- In the discrete case, it is not known how to
parameterize a MAG - some progress has been made.
68Some Markov Equivalent Mixed Ancestral Graphs
SEX PE
CP IQ
SEX PE
CP IQ
SEX PE
CP IQ
SEX PE
CP IQ
These different MAGs all have the same
d-separation relations.
69Partial Ancestral Graphs
SEX PE
CP IQ
SEX PE
CP IQ
SEX PE
CP IQ
o
o
o
o
SEX PE
CP IQ
SEX PE
CP IQ
o
o
Partial Ancestral Graph
70Partial Ancestral Graph represents MAG M
- A is adjacent to B iff A and B are adjacent in M.
- A ? B iff A is an ancestor of B in every MAG
d-separation equivalent to M. - A ? B iff A and B are not ancestors of each other
in every MAG d-separation equivalent to M. - A o? B iff B is not an ancestor of A in every MAG
d-separation equivalent to M, and A is an
ancestor of B in some MAGs d-separation
equivalent to M, but not in others. - A o?o B iff A is an ancestor of B in some MAGs
d-separation equivalent to M, but not in others,
and B is an ancestor of A in some MAGs
d-separation equivalent to M, but not in others.
71Partial Ancestral Graph
- Partial Ancestral Graph
- represents ancestor features common to MAGs that
are d-separation equivalent - d-separation relations in the d-separation
equivalence class of MAGs. - Can be parameterized by turning it into a mixed
ancestral graph - Can be scored and evaluated like MAG
72FCI Algorithm
- In the large sample limit, with probability 1,
the output is a PAG that represents the true
graph over O - If the algorithm needs to test high order
conditional independence relations then - Time consuming - worst case number of
conditional independence tests (complete PAG) - Unreliable (low power of tests)
- Modified versions can halt at any given order of
conditional independence test, at the cost of
more Cant tell answers. - Not useful information when each pair of
variables have common hidden cause. - There is a provably correct score-based search,
but it outputs cant tell in most cases
73Output for College Plans
SES
SEX PE
CP IQ
SES
SEX PE
CP IQ
o
o
o
o
o
o
Output of FCI Algorithm PAG
Corresponding to Output of PC
Algorithm These are different because no DAG can
represent the d-separations in the output of the
FCI algorithm.
74From Sets of DAGs to Effects of Manipultions -
May Be Hidden Common Causes
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
75Manipulation Model for PAGs
- A PAG can be used to calculate the results of
manipulations for which every DAG represented by
the PAG gives the same answer. - It is possible to tell from the PAG that the
policy variable for PE is d-separated from CP
given PE. Hence P(CPPE) is invariant.
SES
SEX PE
CP IQ
o
o
76Comparison with non-latent case
- FCI
- P(cppeP(PE)) P(cppe).
- P(CP0PE0P(PE)) .063
- P(CP1PE0P(PE)) .937
- P(CP0PE1P(PE)) .572
- P(CP1PE1P(PE)) .428
- PC
- P(CP0PE0P(PE)) .095
- P(CP1PE0P(PE)) .905
- P(CP0PE1P(PE)) .484
- P(CP1PE1P(PE)) .516
77Good News
In the large sample limit, there is an algorithm
(FCI) whose output is arbitrarily close to
correct (or output cant tell) with probability
1 (pointwise consistency).
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
78Bad News
At every finite sample size, every method will be
arbitrarily far from truth with high probability
for some values of the truth (no uniform
consistency.)
- Effect of Manipulation
- Causal DAGs Background Knowledge
- Causal
Axioms, Prior - Population Distribution
- Sampling and
Distributional - Sample Assumptions, Prior
79Other Constraints
- The disadvantage of using MAGs or FCI is they
only use conditional independence information - In the case of latent variable models, there are
constraints implied on the observed margin that
are not conditional independence relations,
regardless of the family of distributions - These can be used to choose between two different
latent variable models that have the same
d-separation relations over the observed
variables - In addition, there are constraints implied on the
observed margin that are particular to a family
of distributions
80Examples of Open Questions
- Complete non-parametric manipulation calculations
for partially known DAGs with latent variables - Define strong faithfulness for the latent case.
- Calculating constraints (non-parametric or
parametric) from latent variable DAGs - Using constraints (non-parametric or parametric)
to guide search for latent variable DAGs - Latent variable score-based search over PAGs
- Parameterizations of MAGs for other families of
distsributions - Completeness of do-calculus for PAGs
- Time series inference
81Introductory Books on Graphical Causal Inference
- Causation, Prediction, and Search, by P. Spirtes,
C. Glymour, R. Scheines, MIT Press, 2000. - Causality Models, Reasoning, and Inference by J.
Pearl, Cambridge University Press, 2000. - Computation, Causation, and Discovery (Paperback)
, ed. by C. Glymour and G. Cooper, AAAI Press,
1999.