A Bayesian Analysis of Sub-Samples in Survival Analysis
Zeleke Worku*
Tshwane University of Technology (TUT) Business School, South Africa
Submission: June 12, 2017 Published: October 04, 2017
*Corresponding author: Zeleke Worku, Tshwane University of Technology (TUT) Business School, South Africa, Tel: (+27-12] 382 3050/ (+27-82] 870 2758, Fax: (+27-12) 382 3052; Email: WorkuZ@tut.ac.za
How to cite this article: Zeleke W. A Bayesian Analysis of Sub-Samples in Survival Analysis. Biostat Biometrics Open Acc J. 2017; 3(2): 555609. DOI:10.19080/BBOAJ.2017.03.555609
Abstract
A Bayesian analysis of sub-samples was performed in order to estimate predictors of survival that were based on exponential and Weibull models. Analysis was done by using data followed up over a five-year period on a random sample of 4001 children under five years of age and their mothers from the Maseru District of Lesotho. In the process of constructing both models, the presence of censored observations and truncation at the age of five years were taken into account. Bayesian inference with Gibbs sampling was used in order to obtain marginal posterior densities for each predictor variable and the intercept. The Weibull model gave a better fit in comparison with the exponential model
Keywords: Sub samples; Censored observations; Bayesian analysis; Weibull distribution; Gibbs sampling
Introduction
The study was based on a 5-year-long follow-up study of 4, 001 children who were born in the Maseru District of Lesotho [1] . All children and their mothers in the study were followed up over the five-year period of study. During the 5-year follow-up period, 393 of the 4001 children had died. Data on children in the study were collected from clinical records and their mothers by health professionals from the Health Ministry of Lesotho.
The objectives of the study were
A. to investigate the relationship between the lifetime of children and 9 dichotomous predictor variables that affect the lifetime of children using specially constructed exponential and Weibull models to account for truncation and censored observations [2], and
B. To demonstrate how the drawing of subsamples considerably simplifies the daunting task of estimating parameters from a joint posterior density involving several predictor variables using Matlab programing [3] Bayesian inference [4] and Gibbs sampling [5,6].
An attempt was made to analyze the relationship between the lifetime of children and 9 predictor variables that affect their survival using conventional methods such as maximum likelihood estimators from a number of distributions and the normal and log-normal models [7-9]. However, the estimated models failed to fit the data well. This was confirmed by the use of widely used diagnostic procedures in the case of maximum likelihood estimators and hazard function plots in the case of the lognormal model. Conventional methods failed because
I. The error terms were not distributed normally with mean 0 and constant variance regardless of the large sample size,
II. The data included censored observations and the lifetime variable had to be truncated at t=5 years. As a remedy, special Weibull and exponential models were constructed to account for censored observations and truncation at the age of 5 years.
Parameters were estimated using Bayesian inference and Gibbs sampling by drawing several subsamples. The use of subsamples instead of the complete sample considerably reduced the time taken for estimation of parameters using Matlab programming and led to more or less similar point estimates of parameters at a cost of inflating the variance of estimation slightly.
In keeping with principles of Bayesian inference, the posterior marginal density was obtained as the product of the likelihood and the prior. For the exponential model, uniform priors were used to estimate the parameters β0,β1,......β9. For the Weibull model a special reference prior was used [10] to estimate regression coefficients due to the presence of censored observations. For the Weibull model, it was assumed that σ is unknown, and the non-informative prior was used to estimate it. All prior and likelihood densities used to obtain posterior densities were constructed taking the presence of censored observations and truncation at t=5 years into account. To facilitate computation, analysis was done by drawing 50 subsamples of size 400 each drawn from the complete sample. From the joint posterior densities, marginal densities were obtained. Point estimates and 95% credibility intervals were obtained for each parameter in the study. The Weibull model gave more realistic estimated probabilities than the exponential model.
The following items will be discussed in the order shown below:
I. The list of variables of study and their levels.
II. The drawing of sub-samples from the complete sample.
III. The exponential model and results obtained from the exponential model.
IV. The Weibull model and results obtained from the Weibull model.
V. Comparison of results from the exponential and Weibull models.
List of variables of study and their levels
t: The lifetime of the child in years ( )
: The literacy status of the mother
1. Illiterate
0. Literate
: The income status of the family
1. Low
0. otherwise
: The place of delivery of the child
1. outside a health facility
0. Otherwise
: Attendance of antenatal health care services by the mother
1. No
0. Yes
: Attendance of postnatal health care services by the mother
1. No
0. Yes
: Availability of a nearby health facility
1. No
0. yes
: The extent of malnutrition of the child
1. Serious
0. otherwise
: The presence of acute respiratory infections
1. Yes
0. no
: Age of mother at first birth
1. Less than 18 years
0. Greater than or equal to 18 years
The drawing of subsamples
Data analysis was done on a PC using Bayesian inference, Gibbs sampling and MATLAB programming. Simulations resulting from the use of Gibbs sampling with a sample of size 4001 cases and 10 variables took very long, and it was necessary to shorten the time needed for analyses by drawing several subsamples of size 400 (10% of the complete sample size) and derive the posterior distributions based on the subsamples using the Gibbs sampler. Besides, one major objective of the study was to demonstrate that the Gibbs sampler does simplify the estimation of parameters using Bayesian inference considerably. It is true that the posterior distribution of a parameter based on a large sample has a smaller variance or spread than the posterior distribution of a parameter based on a small sample. Realizing this however, a procedure of drawing subsamples of size 400 out of 4001 cases was carried out. This procedure was repeated several times as in the bootstrap sampling-resampling process [5]. Analytical results are therefore based on posterior densities derived from the subsamples.
In what follows next, the efficiency ofthe procedure of drawing subsamples from the complete sample will be demonstrated using the Weibull model to estimate the parameter μ Assume that T ~ WEI(μ, σ) where T denotes the lifetime of the child. The probability density function of T is given as follows:
γ =0.5772 = Euler's constant.
For the parameters µ and σ , it can be shown that
Suppose that σ = 1.48 is known, and we wish to estimate µ from a simulated random sample of size N = 4001 using fT (t) given μ=-1 . Let the prior distribution of μ be p(μ) ∞ 1 where p(μ) is a non-informative reference prior [10]. The likelihood function of μ is given as follows:
In Figure 1, the posterior distribution for a given sample t1,...., tNis shown, denoted by (1). However, if a subsample of size 400 is drawn from the sample, the posterior distribution of μ looks like the figure denoted by (2) in Figure 1. The variance of (2) is of course larger than the variance of (1). Subsamples of size 400 are drawn repeatedly from the sample 50 times, and then the average of the 50 posterior distributions μ is obtained. The resulting average posterior distribution of μ looks like the figure denoted by (3) in Figure 1. Posteriors (1) and (3) in Figure 1 give almost the same means, which means that Bayes estimates of μ are about the same as estimates obtained from the complete sample. (The true value of μ is -1.).
Even if posterior densities based on subsamples have larger spreads than posterior densities based on the complete sample, the process of drawing subsamples does not affect estimates of the expected values of parameters significantly. The standard deviation of the posterior distribution based on the complete sample is approximately 4 times smaller than the standard deviation of the posterior distribution based on the mean of 50 posterior densities corresponding to the 50 subsamples drawn for the experiment. As it turned out, it was easier to do analyses based on the 50 subsamples drawn from the compete sample. Furthermore, results from the subsamples were more or less as good as results from the complete sample.
The exponential model and results
Let T denote the lifetime of the child. Let x1,....,x9 denote the 9 dichotomous predictor variables that affect the lifetime T. If the lifetime T follows the exponential distribution, the pdf and cdf of T will be given as follows:
The effect of censored observations
At the time of data collection, 3608 of the 4001children in the study were still alive. We thus assume a random sample of size n observations on the lifetimes t1 ,t2,.......tn where most of the t's are censored observations of Type I. Let the symbol be used to indicate whether the child is dead or alive.
The likelihood function
Let the pdf and cdf of T be denoted by and respectively. The likelihood function becomes:
The effect of truncation at t = 5 years
Let p1 denote the probability that the child dies before the age of five. Then, the distribution of T will have a spike of length p2 = 1-p1 at t = 5 since the data reveals no information about children with ages older than 5 years. In this study, p1 is fixed. It is assumed that p1= 0.135 or the under-five mortality rate of lestho.Thus p2=1-0.135=0.865
After truncation at t=5 years, The pdf of T becomes
After truncation at t=5 years, the likelihood function for t≤5 becomes:
Using Bayesian inference, the posterior density is a function of the product of the prior density and the likelihood function. For the exponential model, the uniform prior and the likelihood function given in (3.7) are multiplied with each other to obtain the joint posterior density. From the joint posterior density, the marginal densities corresponding to each parameter are obtained. Point estimates and 95% credibility intervals are given for each parameter of study.
For any age t such that 0<t≤5 years, the expected lifetime of the child at birth, E(T), is approximated by the expression shown in (3.8):
The estimated probability that the child survives more than t years at birth, is denoted by STM where 0 < t <5 ,and is approximated by the following expression shown in (3.9):
The probability of surviving more than 1 year at birth is denoted by ST (t=1) , and is obtained by using t=1 in the expression for ST( in (3.9). Doing so, we obtain the following simplified expression for st (t = 1):
In(3.8)and (3.10),p1=0.135,p2=1-p1=0.865.given in (3.3). Using the Gibbs sampler, the marginal posterior densities of the parameters were obtained by simulation. From the simulated posterior densities, mean values were obtained and used as Bayes estimates. Computer programs were written in Matlab. The exponential model gave estimates of regression coefficients and 95% credibility intervals shown in Table 1. The next step is to use the expressions in (3.3), (3.8) and (3.10) to work out values of E(T) and ST(t=1) respectively for the exponential model using estimated results shown in Table 1 above. The 9 predictor variables mentioned in the expression for .£f in (3.3) are each dichotomous. The high level of each variable is given by xi=1; i=1,...9 while the low level of each is given by xi=0; i=1,...9(see list of variables and their levels). Four interesting possible combinations of the 9 predictor variables will be used for discussion of cases 1 to 4
Case 1: When xi =0; i = 1,..,9 (when all 9 explanatory
Case 2: When x1 = x3 = x7 = x9 = 0 (when 5 of the 9 variables are at their low levels), explanatory variables are at their low levels),
Case 3: When x2 = x4 = x6 = X8 = 0 (when 4 of the 9 explanatory variables are at their low levels),
Case 4: When xi = 1; i =1,......,9 (when all 9 explanatory variables are at their high levels),
In Case 1, the child is exposed to minimum risk as each of the 9 predictor variables is fixed at low level. Case 2 carries more risk than Case 1 as 5 of the 9 predictor variables are fixed at low level. Case 3 carries more risk than Case 2 as 4 of the 9 predictor variables are fixed at low level. Case 4 carries maximum risk as none of the 9 predictor variables is fixed at low level. For the 4 cases discussed above, values of can be summarized as shown below in (3.11):
Using values of shown in (4.23) and the expressions in (4.16), (4.21) and (4.22), the expected value of T can be estimated for Cases 1 to 4 using (4.20). Estimated values of E(T) are summarized as shown below in (4.24) for the 4 cases:
The result in (3.12) shows that when all 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 4.6604 years. When 5 of the 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 4.6211 years. When 4 of the 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 4.4322 years. When none of the 9 predictor variables is kept at its low level, the expected lifetime of the child at birth reduces to 4.3841 years.Using the expression in (3.8), the expected lifetime E(T) at birth of any child can be predicted for any possible values of x1,x2,......,x9 and values of t such 0<t≤5 using values of the estimated regression coefficients shown inTable 1
For the exponentialmodel, the probability of surviving more than 1 year, ST(t=1) , is computed using the expression in (3.10) and estimated results in Table 1.For the casesdiscussed earlier,the following values of ST(t=1) is estimated:
For the children in the study, the probability of surviving more than 0 years is given by the expression in (3.14):
If xi=0; i=1,.....,9 in (3.14).it follows that ST(t=0)=1=p1+p2. That is,
The fact demonstrated in (3.15) shows that the survival probability function introduced earlier in (3.10) is well defined.
The weibull model and its results
Let T denote the lifetime of the child. T~WEI(λ, β) . The pdf of T is given by
The Weibull pdf can be reparametrized in terms of μ and σ where
Reparametrizing the Weibull pdf in terms of μ and σ enables us to express μ as a linear combination of the 9 covariates x1,...,x9 introduced earlier in the study.
The effect of truncation at t = 5 years
Let p1 denote the probability that the child dies before the age of five. Then, the distribution of T will have a spike of length p2 = 1 - p1 at t=5 since the data reveals no information about children with ages older than 5 years. The quantity p1 is fixed, and is equal to 0.135 or the under-five mortality rate of Lesotho. Thus, p2 =1 - 0.135 = 0.865 .
In (4.8), σ is estimated by which is the estimated standard deviation of -logT. ( is a crude maximum likelihood estimator of -logT which disregards the presence of censored observations in the data and the fact that the pdf is truncated at t=5.). In (4.8), if t=5, v (t - 5) will be obtained. Throughout the discussion of the Weibull model, the quantity ν >(t - 5) shall be denoted by νc . That is,
νc=ν(t=5)(4.10)
The effect of censored observations
We are interested in the estimation of the parameters and σ given a sample (ti, xi, δi, i=1,....n) of n observations on the lifetime variable T where
In the Weibull model, the uniform prior is used to estimate β0, β1 ,....,β9while the non-informative prior is used to estimate σ . That is,
The likelihood function
Due to the presence of censored observations, the following likelihood is used:
are the pdf and cdf of T respectively. In Bayesian analysis, the posterior distribution is a function of the product of the likelihood function and the prior distribution. That is,
Using the posterior distribution shown in (4.14), each marginal posterior density corresponding to the explanatory variables used in the study can be obtained. The posterior density in (4.14) takes into account the presence of censored observations and truncation at t=5 years. The expected lifetime at birth is obtained as follows:
Using (4.18) and (4.19) in (4.17), the expected value of T becomes
By using the Gibbs sampler, the marginal posterior densities of the parameters were obtained by simulation. From the simulated posterior densities, mean values were obtained and used as Bayes estimates. Computer programs were written in Matlab. The Weibull model gave the estimates of regression coefficients and 95% credibility intervals shown below in Table 2.
Case 1: When xi = 0; i = 1,......,9(when all 9 explanatory variables are at their low levels),
Case 2: When x1 = x3 = x5 = x7 = x9 = 0 (when 5 of the 9 explanatory variables are at their low levels),
Case 3: When x2 = x4 = x6 = x8 = 0 (when 4 of the 9 explanatory variables are at their low levels),
Case 4: When xi = 1; i = 1,....,9 (when all 9 explanatory variables are at their high levels),
In Case 1, the child is exposed to minimum risk as each of the 9 predictor variables is fixed at low level. Case 2 carries more risk than Case 1 as 5 of the 9 predictor variables are fixed at low level. Case 3 carries more risk than Case 2 as 4 of the 9 predictor variables are fixed at low level. Case 4 carries maximum risk as none of the 9 predictor variables is fixed at low level. For the 4 cases discussed above, values of can be summarized as shown below in (4.23):
Using values of shown in (4.23) and the expressions in (4.16), (4.21) and (4.22), the expected value of T can be estimated for Cases 1 to 4 using (4.20). Estimated values of E(T) are summarized as shown below in (4.24) for the 4 cases:
The result in (4.24) shows that when all 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 4.97 years. When 5 of the 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 2.73 years. When 4 of the 9 predictor variables are kept at their low level, the expected lifetime of the child at birth is 1.98 years. When none of the 9 predictor variables is kept at its low level, the expected lifetime of the child at birth reduces to 0.89 years.
Using the expression in (4.20), the expected lifetime E(T) at birth of any child can be predicted for any possible values of x1, x2,...,x9 and values of t such that 0 < t ≤5 using values of the estimated regression coefficients shown in Table 2. It should be noted that the E(T)≤5 years as the pdf of T is truncated at t=5 years. The estimated probability of surviving more than t years from birth for 0 < t ≤5 is given by:
In (4.25),p1 = 0.135 p2 = 0.865 , α = 0.6 , 𝒞 ,is given in (4.22), Vc is given in (4.16) and is given in (4.15). In (4.25), values of t are such that 0 < t <5 .The probability of surviving more than 1 year is denoted by ST (t = 1) , and can be evaluated for each of the 4 cases discussed above using t=1 in (4.25). Case 1: x= 0; i = 1,....,9 (when all 9 explanatory variables are at their low levels) Case 2: x1= x3 = x5 = x7 = x9 = 0 (when 5 of the 9 explanatory variables are at their low levels)
Case 3: x2 = x4 = x6 = x8 = 0 (when 4 of the 9 explanatory variables are at their low levels)
Case 4: xi=1; i = 1,. . ,9 (when all 9 explanatory variables are at their high levels) In Case 1, the child is exposed to minimum risk as each of the 9 predictor variables is fixed at low level, and hence the probability of survival should be a maximum. Case 2 carries more risk than Case 1 as 5 of the 9 predictor variables are fixed at low level. As a result, the survival probability of Case 2 should be smaller than that of Case 1. Case 3 carries more risk than Case 2 as 4 of the 9 predictor variables are fixed at low level (Table 3). Thus, the survival probability of Case 3 should be smaller than that of Case 2. Case 4 carries maximum risk as none of the 9 predictor variables is fixed at low level, and hence the survival probability of Case 4 should be the smallest of the 4 probabilities. For the 4 cases discussed above, values of are obtained as follows:
Comparison of Results
From Table 4, it can be seen that estimates from the Weibull model have larger magnitudes (in absolute value) than estimates from the exponential model. Besides, all estimates (except for the intercept) from the Weibull model are positive, while 3 estimates from the exponential model are negative even if level 1 (the high level) of each of the 9 predictor variables in the study contributes to the probability of death. This fact shows that estimates from the Weibull model are larger in magnitude and more realistic than estimates from the exponential model. Table 5 shows that, for the Weibull model, as the magnitude of risk increases, the expected lifetime at birth decreases over an evenly spread and more realistic range of lifetime. Although the same trend is observed in the exponential model, the variation is limited from 4.66 years to 4.38 years only. This shows that the Weibull model gives a more realistic result. Table 6 shows that, for the Weibull model, as the magnitude of risk increases, the probability of surviving more than a year decreases. For the exponential model however, the magnitude of risk and the probability of surviving more than a year are directly related. This shows that the Weibull model gives a better fit and more realistic estimates in comparison with the exponential model in survival analyses procedures.
References
- De Waal DJ, Worku ZB, Groenewald PCN (1998) The Effect of the Duration of Breastfeeding on the Lifetime of Children in Lesotho. South African Statistical Journal 32(1): 67-82.
- Collett D (2015) Modelling Survival Data in Medical Research. CRC press, New York, USA.
- Venkataraman P (2009) Applied Optimization with MATLAB Programming. John Wiley & Sons, New York, USA.
- Gill J (2014) Bayesian methods: A Social and Behavioral Sciences Approach. CRC press, New York, USA.
- Smith AFM, Gelfand AE (1992) Bayesian Statistics Without Tears: A Sampling-Resampling Perspective. The American Statistician 46(2): 84-88.
- Gelman A, Carlin JB, Stern HS, Rubin DB (2014) Bayesian Data Analysis. Boca Raton, FL, USA.
- Asteriou D, Hall SG (2015) Applied Econometrics. Palgrave Macmillan, New York, USA.
- Espejo MR (2015) Statistical Methods for Ranking Data. John Wiley & Sons, New York, USA.
- Harrell F (2015) Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression and Survival Analysis. Springer, New York, USA.
- De Waal DJ, Groenewald PCN, Kemp CJ (1995) Perturbation of the Normal Models in Regression. South African Statistical Journal 29(1): 105-130.