Title: State Space Models for Survival Analysis
1State Space Models for Survival Analysis
- Weiming Ke
- The University of Memphis
- (Joint work with Dr. Wai-yuan Tan)
- June 29, 2004
2Outline
- Introduction
- State Space Model
- Estimation of Parameters
- ? Multi-level Gibbs Sampling Procedure
- ? Weighted Bootstrap Procedure
- Estimation of the Survival Probabilities
- An Illustrative Example
- Computer Simulation
- Summary
3Introduction
- Many diseases such as AIDS, cancer and infectious
diseases are often very complicated biologically.
- Most of these diseases are complex stochastic
processes where it is often very difficult to
estimate the unknown parameters, especially in
cases where not many data are available.
4Introduction (cont.)
- In this article we propose a state space modeling
approach by combining stochastic models with
statistical models to describe the process of a
disease. - Then we will apply the Gibbs sampling method and
the weighted bootstrap method to estimate the
unknown parameters and the state variables of the
model. - By using these estimates, we can validate the
model and then estimate the survival
probabilities.
5State Space Model
- the state space model of a system consists of
two sub-models - 1. The stochastic system model which is the
- stochastic model of the system.
- 2. The observation model which is a
statistical - model relating some available data to the
system. - It extracts biological information from the
system via its stochastic system model and
integrates this information with those from the
data through its observation model.
6State Space Model (cont.)
- The state space model was originally proposed by
Kalman in the early 60s for engineering control
and communication (Kalman, 1960). Since then it
has been successfully used as a powerful tool in
aerospace research. - It was first proposed by Tan and his associates
for AIDS and cancer research (Tan et al., 1998,
1999, 2000, 2001, 2002). - Apparently state space models can be extended to
other diseases as well, such as infectious
diseases.
7Advantages of the State Space Models
- The state space model of a system is advantageous
over the stochastic model of the system alone or
the statistical model of the system alone since
it combines information and advantages from both
of these models. - The followings are some specific advantages of
the state space models
8Advantages of the State Space Models (cont.)
- State space model provides an optimal procedure
to update the model by new data which may become
available in the future. This is the smoothing
step of the state space models. - The state space model provides an optimal
procedure via the Gibbs sampling method and the
weighted bootstrap method to estimate
simultaneously the unknown parameters and the
state variables of interest.
9The stochastic system model
- In many cases, we can derive stochastic equations
for the state variables of a system by using
basic biological mechanism of the disease. - We will illustrate the state space modeling
approach by using a birthdeathillnesscure
process for a disease such as tuberculosis.
10The stochastic system model (cont.)
- Consider a population of individuals who are at
risk for a disease, such as tuberculosis. - In this population, there are two types of
people the normal healthy people who do not have
the disease and the sick people who have
contracted the disease. - Let N1(t) and N2(t) denote the numbers of the
normal people and the sick people respectively in
the population at time t.
11The stochastic system model (cont.)
- To derive stochastic differential equations for
the state variables N1(t) and N2(t) , the
following transition variables are used - F1(t) Number of normal healthy people who
become sick - during t, t ?t),
- F2(t) Number of sick people who are cured by
the drug - during t, t ?t),
- B1(t),B2 (t) Numbers of births of N1, N2
people during t, t ?t), - D1(t),D2(t) Numbers of deaths of N1, N2 people
during t, t ?t) - R1(t),R2 (t) Numbers of immigrants of N1 and
N2 people during - t, t ?t),
12The stochastic system model (cont.)
- Let a1 denote the disease rate, and a2 the cure
rate. It means that a1 and a2 are the transition
rates of N1?N2 and N2?N1 respectively. - Let b1, d1, ?1 denote the birth rate, death
rate and immigration rate of the N1 people. - Let b2, d2, ?2 denote the birth rate, death
rate and immigration rate of the N2 people.
13The stochastic system model (cont.)
- Assume that during the time interval t, t ?t),
the birthdeathillnesscure processes follow the
multinomial distributions with parameters b1?t,
d1?t, a1?t, and b2?t, d2?t, a2?t. - Assume that the immigration follow the Poisson
distributions with means ?1(t)?t and ?2 (t)?t .
14The stochastic system model (cont.)
- Then, given N1(t) and N2(t) , the conditional
probability distributions of - B1(t), D1(t), F1(t) and B2(t), D2(t),
F2(t) are given by - B1(t), D1(t), F1(t) N1(t)
- Multinomial N1(t) b1?t, d1?t, a1?t,
- B2(t), D2(t), F2(t) N2(t)
- Multinomial N2(t) b2?t, d2?t, a2?t.
15The stochastic system model (cont.)
- The conditional probability distributions of
- R1(t) and R2(t) are given respectively by
- R1(t) N1(t) Poisson with mean N1(t)?1(t)?t
, - R2(t) N2(t) Poisson with mean N2(t)?2(t)?t
. -
16The stochastic system model (cont.)
- Then, we have the following stochastic
differential equations for N1(t) and N2(t) - N1(t ?t) N1(t)R1(t)F2(t)B1(t)-F1(t)-D1(t)
- N2(t ?t) N2(t)R2(t)F1(t)B2(t)-F2(t)-D2(t)
17The observation model
- If some observed data are available from this
system, then we can derive some statistical
models to relate the data to the system. - For the observation model, assume that observed
numbers of N1 and N2 people at times tk , k 1,
. . . , n are available. - Let Y1(k) and Y2(k) be the observed numbers of N1
and N2 people at time tk .
18The observation model (cont.)
- Assume that and
are normally
distributed with mean 0 and variances s12 and
s22, independently for k 1, . . . , n. The
observation models are represented by the
statistical models that are given by - Y1(k) N1(tk) N1(tk)1/2e1 ,
- Y2(k) N2(tk) N2(tk)1/2e2 ,
- for k 1, . . . , n
- where e1 and e2 are independently
distributed as normal with mean 0 and variances
s12 and s22 .
19Estimation of Parameters (cont.)
- From the state space models, we can estimate
simultaneously the unknown parameters and the
state variables through the multi-level Gibbs
sampling method and the weighted bootstrap
method. - For the above example, birth rates b1, b2,
death rates d1, d2, and immigration rates
?1, ?2 of the normal and sick people, and the
transition rates a1, a2 of N1?N2 and N2?N1 are
the parameters to be estimated.
20Estimation of Parameters (cont.)
- To illustrate, let X be the collection of all the
state variables N1(t), N2(t), T the collection
of all unknown parameters b1, b2, d1, d2, ?1,
?2, a1, a2, and Y the collection of vectors of
observed data sets Y1(k), Y2(k), k 1, . . . ,
n. - Let P(T) be the prior distribution of the
parameters T, P(X T) the conditional
probability density of X given the parameters T,
and P(Y X, T) the conditional probability
density of Y given X and T.
21Estimation of Parameters (cont.)
- The joint probability density function of (X, Y,
T) is - P(X, Y, T) P(T) P(X T) P(Y X, T)
- From above, we can derive the conditional
probability density function P(X T, Y) of X
given (T, Y) and the conditional probability
density function P(T X, Y) of T given (X,Y),
respectively, as - P(X T, Y) ? P(X T) P(Y X, T),
- and P(T X, Y) ? P(T) P(X T) P(Y X, T).
22Gibbs sampling procedure
- The multi-level Gibbs sampling method is an
extension of the Gibbs sampling method to the
multivariate cases. - The method was first proposed by Sheppard
(1994). - This method was a useful tool for estimating the
unknown parameters and state variables through a
sequential procedure.
23Gibbs sampling procedure (cont.)
- The multi-level Gibbs sampling procedures for
estimating the unknown parameters T and the state
variables X are given by the following loop - (1) Given initial values of T and observed data
Y, generate X from P(X Y, T) through the
weighted Bootstrap method due to Smith and
Gelfand (1992) .
24Gibbs sampling procedure (cont.)
- (2) Generate T from the conditional distribution
- PT X, Y where X is the value obtained
- in step (1).
- (3) Using T obtained from (2) as initial values,
- go back to (1) to generate X, and repeat
the - (1), (2) loop until convergence.
25Gibbs sampling procedure (cont.)
- At convergence, we then generate a random sample
of X from the conditional distribution PX Y,
and a random sample of T from the posterior
distribution PT Y. - Repeating these procedures we then generate a
random sample of X and a random sample of T. we
may then use the sample means as the estimates of
X and T, and use the sample variances as the
variances of these estimates.
26Weighted bootstrap procedure
- Because in practice it is often very difficult to
derive P(X Y, T), whereas it is easy to
generate X from P(X T). - We developed an indirect method by using the
weighted bootstrap method due to Smith and
Gelfand (1992) to generate X from P(X Y, T)
through the generation of X from P(X T). - The proof of this algorithm has been given in Tan
(2002).
27Weighted bootstrap procedure (cont.)
- The algorithm of the weighted bootstrap method is
given by the following steps - (a) Given T and X(i), generate a large
random sample of size m on X(i1) by using
PX(i1)X(i) from the stochastic system model
denoted it by - X(1)(i1), , X(m)(i1)
- (b) Computing ?k PY(i1) X(k)(i1), T by
using the statistical model. - Computing qk ?k ? (?1 ?2 ?m )
- for k 1, , m.
28Weighted bootstrap procedure (cont.)
- (c) Construct a population ? with element E1,
, Em and with P(Ek) qk . - Draw an element randomly from ?. If the
outcome is Ek, then X(k)(i1) is the element of
X(i1) generated from the conditional
distribution of X given the observed data Y and
the parameter T. - (d) Start with i 1 and repeat (a) ? (c) until
i tM to generate a random sample of X from P(X
Y, T).
29Estimating Parameters
- Now we explain how to generate parameters
- T T1, T2, , Tk from P(T X, Y) by using
the multi-level Gibbs sampling method. - With no loss of generality, we illustrate the
method with k 3 and write T T1, T2, T3. - The procedure for estimating the parameters goes
through the following loop
30Estimating Parameters (cont.)
- Given initial values of T and X, generate
- T1 from P(T1 Y, X, T2, T3), and
- denote it by T1(1).
- (2) Generate T2 from P(T2 Y, X, T1(1), T3) and
denote it by T2(1), where T1(1) is the value
obtained in step (1).
31Estimating Parameters (cont.)
- Generate T3 from PT3Y, X, T1(1), T2(1) and
denote it by T3(1), where T1(1) and T2(1) are the
values obtained in (1) and (2). - Using T1(1), T2(1), T3(1) obtained from
- (1)?(3) as initial values, generate X from
- P(X Y, T) and denote it by X(1).
32Estimating Parameters (cont.)
- (5) Using T T1(1), T2(1), T3(1) and X X(1)
- obtained from (1)?(4) as initial values, go
- back to repeat (1)?(4) to generate T1(2),
- T2(2), T3(2) and X(2), and repeat the loop
- until convergence.
33Estimating Parameters (cont.)
- At convergence, we can generate a random sample
of X from the conditional distribution P(X Y)
of X given Y, and a random sample of T from the
posterior distribution P(T Y) of T given Y. - Repeating these procedures we then generate a
random sample of X and a random sample of T.
34Estimating Parameters (cont.)
- We may then use the sample means to derive the
estimates of the parameters T and the state
variables X, and use the sample variances as the
variances of these estimates. - The convergence of these procedures is proved by
using the basic theory of homogeneous Markov
chains see (Tan, 2002, Chapter 3).
35Estimation of the survival probabilities
- By using the estimates of the parameters, we can
estimate the survival probabilities. - We will illustrate the procedure by using the
same example. - Let d1 and d2 denote the death rates of the
normal and sick people, respectively a1 and a2
are the transition rates of N1?N2 and N2?N1,
respectively.
36Estimation of the survival probabilities (cont.)
- Let S1(t) and S2(t) denote the survival
probabilities that normal and sick people will
survive at time t when the population is at risk
for the disease. - Then we have the following system of equations
-
37Estimation of the survival probabilities (cont.)
- The solution of the above equations are given by
- where w a1 a2 d1 d22 4a2 d2
d1 1/2. - d1 a1 a2 d1 d2 w.
- d2 a1 a2 d1 d2 ? w.
38An illustrative example
- To illustrate the above methods, consider the
disease tuberculosis (TB) which is curable by
drugs. - Given in Table 1 are the numbers of TB cases in
USA from 1980 to 1992 reported by CDC together
with the total USA population sizes over these
years (CDC Report, 1993). - Given in Figure 1 are the numbers of TB people.
- In this data set, it is clear that the number of
TB cases in USA is declining to the lowest level
in 1985 and then increases due presumably to the
effects of HIV (CDC Report, 1993).
39Table 1. Observed numbers of total people, TB
people and normal people
40Figure 1. Observed numbers of TB people
41An illustrative example (cont.)
- To fit this data, we thus assume that the
infection rate a1 a1(1) before January 1985 and
assume that a1 a1(2) after January 1985.
- Assume that other parameters are not affected by
HIV and other factors. Because the TBs are rare
in children, we may ignore birth so that the
unknown parameters are T ?1, ?2 d1, d2 a1(1),
a1(2), a2. - Let t0 0 denote January 1980 so that
- N1(0) 226517805 and N2(0) 28000.
42An illustrative example (cont.)
- Using the data given in Table 1, we apply the
Gibbs sampling method and the weighted bootstrap
method to estimate the unknown parameters and
the state variables. - Because we do not have previous knowledge about
the parameters, we assumed a non-informative
uniform prior for the parameters. - The estimates of the parameters are given in
Table 2. - The estimates of the numbers of TB people are
plotted in Figure 2, together with the observed
numbers. - The estimates of the survival probabilities
of the sick people are plotted in Figure 3.
43Table 2 Estimates of parameters and standard
errors
44Figure 2. Estimated and Observed Numbers of TB
people(--- Estimated --- Observed)
45Figure 3. Estimates of the survival probabilities
of TB people
46An illustrative example (cont.)
- From Figure 2, apparently the estimated numbers
of the sick people are close to its observed
numbers. - From results in Table 2, it turned out that the
estimate of a1(2) is slightly greater than a1(1),
indicating that since 1985 HIV and/or other
causes have increased the infection rate of TB
slightly.
47Computer Simulation
- To further examine this approach, we have assumed
some parameter values and generated some computer
Monte Carlo data. - The generated numbers are given in Table 3
- The values of the parameters for generating these
data are ?1 0.05, ?2 0.03, - d1 0.04, d2 0.1, a1 0.2, a2 0.4,
- N1(0) 1000, N2(0) 10.
48Table 3. Generated Numbers of Normal People and
Sick People
49Computer Simulation (cont.)
- Using the data in Table 3 and assuming a
non-informative uniform prior for the parameters,
we have applied the Gibbs sampling procedure and
the weighted bootstrap procedure to estimate the
parameters and the state variables. - Given in Table 4 are the estimates of the
parameters and their true values. - Plotted in Figure 4 are the estimates of the
numbers of the sick people together with the
generated numbers. - Plotted in Figure 5 are the estimates of the
survival probabilities.
50Table 4. Estimates of Parameters and Standard
Errors
51Figure 4. Estimated and Observed Numbers of Sick
people (--- Estimated --- Observed)
52Figure 5. Estimates of the Survival Probabilities
of Sick People
53Computer Simulation (cont.)
- From the results in Table 4, apparently, the
estimates are very close to its true values. - From Figure 4, it is also apparent that the
estimates of the state variables are very close
to the generated numbers. - These results indicate that the methods proposed
in this article are very effective.
54Summary
- In this article, we have developed a state space
model for a birthdeathillnesscure process. - We have developed a procedure (the Gibbs
sampling method together with the weighted
bootstrap method) to estimate the unknown
parameters and the state variables, and hence the
survival probabilities. - The numerical example and the computer simulation
indicate that the methods are quite useful and
promising.
55Discussion
- In recent years, Tan and his associates have
developed some state space models for cancers and
AIDS (Tan and Chen, 1998 Tan and Xiang, 1998
Tan and Xiang, 1999 Tan and Ye, 2000 Wu and
Tan, 2000 Tan et al., 2001 Tan et al., 2002). - The present article extends this modeling
approach to other human diseases such as
tuberculosis. This type of modeling approach is
definitely useful for other diseases such as the
heart disease and to risk assessment of
environmental agents as well. In this respect,
more research works are needed.
56