Title: Causal Inference in Genetics
1Causal Inference in Genetics
Wentian Li ???, Ph.D Robert S Boas Center for
Genomics and Human Genetics, Feinstein Institute
for Medical Research, North Shore LIJ Health
System, NY, USA
2Statistical Correlations
- Rain/sunshine and wet/dry ground
- Wrinkle/smooth face and chance to have cancer
- Genetic marker ?? and disease status (genetic
association and genetic linkage) - One genetic marker and another genetic marker
- Two biomarkers (indicators of a disease) of the
same disease
3Causal correlations
- Rain causes the wet ground
- Wrinkle does not cause the cancer
- Genetic marker may or may not cause the disease
(when it does, it is called functional) - Two markers being close to each on a chromosome
linkage disequilibrium. (The cause/effect
framework does not apply? Either one is a cause
of another) - If two biomarkers are involved in a disease, the
one upstream in a pathway is cause, and
downstream is effect
4Causal inference in Genetics ???, Epidemiology
????,Genomics ???? Is It Necessary?
- Risk factors of a disease have to be evaluated.
Concept of confounding factor ???? (variable that
is related to one or more variables in the study) - No need for causal inference in
genotype-phenotype correlation (only if Larmarck
were right, inheritance of acquired trait,
there is a reversal) - Im addressing the last example up/down stream
in biochemical pathway of two biomarkers
5It may not be necessary IF
- Both biomarkers are followed in time
- If one biomarker becomes positive ?? first, it is
the more upstream - Unfortunately, temporal information is usually
unavailable, and biomarkers are measured after
the onset of the disease
6Causal inference for observed data is an active
topic outside genetics
- Machine Learning (e.g. causal relationship
between words in a corpus ?? ) - Economics and Social Science
- Biology (Sewall Wrights path analysis)
- Computer science (Bayesian network)
7(No Transcript)
8A typical graphic model
9More complicated
10Even more complicated
11Outline of the talk
- Introduction to the local causality discovery
- Introduction to the dataset
- Results
12 - Introduction to the local causality discovery
- Introduction to the dataset
- Results
13(No Transcript)
14Data Mining and Knowledge Discovery (2000) v4,
pp.163-192
15Problems with a Bayesian network modeling
- All relevant variables have to be included
- The emphasis is not on structure of the network
(sometimes it is even given), but on quantitative
modeling of the transition probabilities
16In reality
- We may not known which variables are relevant
(part of the inference is to find out) - Not all variables are measured
- We may just be interested in who is a cause, who
is an effect, and not interested in quantitative
conclusions
17Local causality discovery
- Focus on three variables only
- It is OK that other relevant variables are not
included - The principle of inference is the exclusion of
causal models that are inconsistent with the data - which may not be successful
18 local causality discovery (LCD) (cont.)
- Six assumptions 1.database completeness. 2.
discrete variables. 3. Bayesian network model
(directed acyclic ???? graph no loops). 4. 5.
no selection bias. 6. valid statistical testing. - Three variables x,y,z
- Hidden ??? variable is allowed to exist but not
measured - Determine six correlations unconditional C(x,y),
C(y,z), C(x,z), and conditional ??? C(x,zy),
C(y,zx), C(x,zy)
19An Example
20Between two variables, there are only 6 causal
relationships (allowing confounding variable), 4
if x can only be a cause
confounding
x
no relationship
confoundingcausing
causing
NO
NO
confounding plus rev causing
Reverse causing
21Number of causal relationships among three
variables
- 6x6x6216 possibilities
- 4x4x696 if x is not caused by either y or z (but
can receive an arrow from a hidden variable)
Cooper97 paper - 2x2x624 if x is not caused by y or z, and
doesnt receive an arrow from hidden confounding
variables Li and Wang, unpublished
22Given a causal model
- Unconditional ??? association between any two
variables can be determined by whether they are
connected by a path - Conditional ??? association can be determined by
the so-called d-separation rule - Applying all possible causal models to the six
correlations (yes or no) and exclude inconsistent
models, which lead to
23CCC causal inference rule
- (Cooper version) if C(x,y), C(y,z), but
C(x,zy)-, - then there are only three possible causal
models x gt y gt z - x lt h gt y gt z
- h gtx gt y gtz
- (Silverstein et al. version) if C(x,y), C(y,z),
C(x,z), but C(x,zy)-, C(x,yy), C(y,zx),
then...
24In words for a three-way correlated set
- If one of the variable (x) is not an effect (only
a cause) - AND
- If correlation is lost between x and z
conditionally, - THEN
- y causes z
x gene y,z two intermediate phenotypes
25biomarker 2
z
x is not an effect
x
causal inference is certain
gene
?
y
biomarker 1
26 - Introduction to the local causality discovery
- Introduction to the dataset
- Results
27Rheumatoid Arthritis (RA)??????
- An autoimmune ????? disease
- Chronic inflammation ?? of joints ??
- Three times more likely to occur in women than
men - Age of onset 40-60
- Twin ??? concordance rates 12-15 for
MZ???,????, 5 for DZ ???? - Genetic and environmental (e.g. smoking) risk
factors
28MHC/HLA the main genetic contribution of RA
- MHC (Major Histocompatibility Complex??????????)
or HLA (Human leukocyte antigens ???????)
HLA-DRB1 gene on chromosome 6 (6p21.3) - The RA associated alleles are HLA-DRB10401,
0404, 0408 (Caucasian), not 0402, 0403, 0407 - In Asian population, different DRB1 alleles are
associated with RA (e.g. 0405, 0901) - A group of DRB1 risk alleles are called shared
epitope (SE) ????, or rheumatoid epitope, code
position 70-74 amino acids in the third
hypervariable region
29An update of recent whole genome association of RA
- PTPN22 (ch1) (Begovich et al, AJHG 2004) has
consistently replicated in Caucasian samples - STAT4 (ch2) (Remmers et al, NEJM, to appear)
- (Plenge et al, submitted)
- Wellcome trust (Nature, June 7, 2007)
30Two Auto-antibodies are strongly associated with
RA RF and anti-CCP
- RF (rheumatoid factor ?????) 80 of RA patients
are RF positive - anti-CCP (anti-cyclic citrullinated peptide
antibody ????????,?CCP??) even better predictor
of RA in early stage - HLA-DRB1, RF, anti-CCP are all associated with
the RA disease, and they are associated with each
other. CCC rule can be applied! - ???,???,???,?, ??????????????????????,
?????,2004,2052-57
31Q Between RF and anti-CCP, which one is the
cause and which is the effect?
32 - Introduction to the local causality discovery
- Introduction to the dataset
- Results
33(No Transcript)
341723 Caucasian RA patients
anti-CCP positive
anti-CCP negative
35Association between RF and DRB1 genotype is lost
conditional on anti-CCP
36biomarker 2
RF
z
SE
x
gene
anti-CCP
y
biomarker 1
37Discussions/Issues
- There are evidences that RA patients become
anti-CCP positive before becoming RF positive - The three-way correlation might be lost in normal
controls (here the data is case-only) - In-between anti-CCP and RF, other factors are
possible (so the cause-effect may not be direct) - It is not clear where the smoking risk factor
comes in
38Co-Authors
- Mingyi WANG (Zhejiang Univ, Computer Science
Department, causal inference) - Patricia Irigoyen, Peter Gregersen (North Shore
LIJ, RA data)
39Advertisements
- Bibliography on microarray data analysis
- www.nslij-genetics/microarray/ and
www.cbi.pku.edu.cn/mirror/microarray/microarray.ht
ml - Bibliography on linkage disequilibrium mapping
www.nslij-genetics.org/ld/ - Bibliography on computational gene recognition
www.nslij-genetics.org/gene/ - Bibliography on features, patterns, correlations
in DNA sequences www.nslij-genetics.org/dnacorr/ - Comprehensive list of genetic analysis programs
www.nslij-genetics.org/soft/ and
linkage.rockefeller.edu/soft/