An Investigation of the Discriminatory Ability of the Clustering Effect of the Frailty Survival Model

A strategy is proposed to examine the importance of the clustering effect of the frailty model by means of the concordance probability. To this end, the methodology proposed in earlier work is extended to general frailty models and a new definition of the concordance probability is developed. The resulting measures allow to separate the discriminatory ability of the covariate effects and the clustering effect on a within- and between-cluster level. Using a Bayesian and a likelihood approach, point estimates and 95% credible/ confidence intervals are computed for the measures. Estimation properties and sensitivity against misspecification are checked in an extensive simulation study. As such, the likelihood estimation procedure showed difficulties in estimating unbiasedly the predictive ability of the clustering effect, while the Bayesian estimation procedure resulted in good estimation properties for all three measures. Other features are developed to complement the earlier methodology, such as an internal validation strategy as well as a procedure to calculate internally validated interval estimates in the presence of clustered data. The ensemble of the developments is illustrated on a dental data set.


Introduction
The frailty model is a popular model to analyze clustered survival data. For this model, the importance of clustering typically is assessed by means of the variance of the frailty distribution only: when the variance is large, clustering is important and visa versa. In this article we would like to investigate the importance of clustering in more detail. To this end, we propose to assess how clustering affects the quality of the model predictions. Despite the importance of the frailty model, few predictive measures have been suggested for the frailty model. To our knowledge, only the concordance probability has been extended to PH frailty models, resulting in 3 separate versions of the concordance probability [1]. These 3 concordance measures allow to evaluate the discriminatory ability on a within-cluster and between-cluster level separately and to measure the predictive gain of the clustering effect on top of the covariate effects. The methodology in [1] shows, however, two major shortcomings which will be tackled in this article.
The first shortcoming in [1] is that the concordance probability is estimated using Harrell's C estimation technique. The disadvantage of this estimation technique is that, in the presence of censoring, it suffers from a bias that depends on the censoring distribution. Moreover, this estimation technique is only designed for a time-unrestricted definition of the concordance probability that assumes survival models with noncrossing survival curves. In this article, we make use of a Inverse Probability Censoring Weighted (IPCW) estimator of [2] designed for the time-restricted definition concordance probability of [3]. This IPCW estimator is proven to be consistent, given that the censoring model is correctly specified. The time-restricted definition of the concordance probability has the advantage that it can be applied to any survival model. Moreover, we also introduce a new time restricted definition of the concordance probability, bearing a slightly different interpretation than the one introduced by [3]. Note that this shortcoming has also been addressed by [4], but their IPCW estimator assumes the PH frailty model as the survival model and a censoring distribution that does not depend on covariates. The second shortcoming in [1] is that their methodology is insufficiently tailored to its use in practice. As such, we provide an internal validation procedure to correct the estimated concordance probabilities for over optimism when the data set is used for both model fitting and model evaluation. For this, we adapt the internal validation procedure proposed for the Brier score for univariate survival data [5]. We also show how one can calculate an interval estimate for the overoptimistic and the internally validated concordance probability of interest.

Biostatistics and Biometrics Open Access Journal
In Section 2.1, the time-unrestricted, the time-restricted definition of [3] and the new time-restricted definition of the concordance probability are introduced for univariate survival data. In Section 2.2, it is shown how these definitions can be adapted for frailty model in general, including interval estimates. The estimation of internally validated point estimates is dealt with in Section 2.3. The estimation properties of the two time-restricted definitions of the concordance probability using a Bayesian and a likelihood approach are investigated in Section 3.1 by means of a large-scale simulation study. The effect of misspecifying the frailty distribution and the censoring model on the estimation of the concordance measures is shown in Section 3.2. All developments are applied to a dental data set in Section 3.3, followed by a discussion in Section 4.

Different concordance probabilities
In the next section, one time-unrestricted and two timerestricted definitions of the concordance probability are shown for univariate survival data, after which for each respective definition its common estimation technique is discussed. [6] proposed the unrestricted concordance probability which is the earliest definition of the concordance probability. It equals the probability that a randomly selected subject , which fails later than another randomly selected subject , has a higher predicted survival time than subject or, Since the subjects of a survival study are followed up over a limited time period, the estimation of the concordance probability should be regarded as particular to the study. The above definition, however, gives the impression that it applies to the whole time range, since no time restriction is imposed on the pairs of subjects that are evaluated in the above definition. As a result, the use of (1) in a study with a limited follow-up implies that the same survival process holds beyond the last follow-up time, which is an unverifiable assumption. Moreover, since (1) evaluates the ranking of the time-independent predicted failure

Definitions
, S X τ τ can only be used as a predictor when its ranking is not time dependent, which does not hold for all failure time models. A restricted version of the concordance probability C τ was therefore introduced by [3], i.e.: With the maximum difference in true failure time between the two randomly selected subjects. As such, (3) measures how well subjects with a maximum difference in failure time can be distinguished from each other, given that the earliest failing member of the pair fails before time point .  (2) and (3) With i x the observed covariates of the failure time model of subject . Pairs that agree with the condition in the denominator (numerator) of the upper formula are called comparable (concordant) pairs. As proven by [2], (4) is known to suffer from an overoptimistic bias that depends on the censoring distribution.
In the remainder of this article, we will no longer consider this estimation technique of the concordance probability. [2] proposed the Inverse Probability Censoring Weighted (IPCW) estimation method for the C τ definition, which corresponds to:

δ =
Further, (5) estimates C τ consistently under conditional independence of the censoring mechanism and the failure time model and if the censoring model is correctly specified. Hence, because the censoring weights only depend on the censoring model, estimator (5) is robust against misspecification of the failure time model [2]. The censoring weights therefore compensate for the use of just comparable pairs in the estimation of , C τ counterbalancing the overoptimistic bias that would have been obtained else wise. Note that the covariates and of the failure time model and the censoring model respectively are not necessarily the same.
Following [2], we have developed the following consistent IPCW estimator of ( ) : The additional weight of the pair ( ) , .

Application to the frailty model
The concordance probability has already been adapted to the framework of Proportional Hazards (PH) frailty models of the power variance family for the unrestricted definition (1) of the concordance probability [1]. In this section, we will extend the concordance probability to the whole class of frailty models. In the next two sections, we restrict ourselves to definition (2). For the developments related to definition (3), we refer to Section A of the online supplementary material.

Definition
The frailty model is a popular model for clustered survival data. For each subject of cluster , q a frailty term q w is introduced to account for clustering. These frailty terms are assumed to be sampled from a frailty distribution, ( ), w f w which mostly is a parametric distribution such as the gamma distribution [7]. For each frailty model two different types of Two types of pairs can be identified for clustered data: an inter-cluster pair, whose members belong to two different clusters, and an intra-cluster pair, whose members belong to the same cluster. Therefore, depending on which type of pair and survival probability is used, 6 different versions of the concordance probability can be introduced for frailty models (Table 1). This results in a between marginal and a between conditional concordance probability BM C τ and , BC C τ respectively, in a within marginal and a within conditional concordance probability WM C τ and , WC C τ respectively, and in an overall marginal and an overall conditional concordance probability OM C τ and , OC C τ respectively. Table 1: Definition of the 6 different types of the concordance probability. 'X' marks which elements are used by the considered measure, while '-' marks which elements are not.

Type Concordance Probability
Intercluster pair Overall Marginal X X

Interpretation and interrelations
For the two overall concordance probabilities, it can be shown π π the probability that a pair is an inter-(intra-)cluster pair [1]. As a result, OC C τ and OM C τ do not add new information to the other 4 versions of the concordance probability. Moreover, their interpretation depends strongly on the sampling scheme of the data structure such that its value is hard to compare across studies with different clustering designs [1]. Since the properties of ( ) Summing up, only 3 unique versions of the concordance probability can be defined for the frailty model and each of these definition focuses on a specific aspect of the frailty model.

Computation of point estimates
Consider a sample of size , consisting of ( ) treating the considered pair (incorrectly) as an intra-cluster pair. As a result, the evaluation of BM C τ can also be completed by means of the conditional survival curves only. Since W C τ can be obtained by determining the ranking of the conditional survival curves only, the same ranking rule for the predicted failure times can be established for BM C τ and W C τ for a class of frailty models, applying this ranking rule to inter-cluster pairs for BM C τ and to intra-cluster pairs for . W C τ As such, only the ranking of the linear predictors is needed in the estimation BM C τ of BM C τ and of the PH frailty model, since: In summary, a ranking rule for the predicted failure times needs to be established for BC assuming that both members of the pair belong to a different cluster and for assuming that both members of the pair belong to the same cluster.

Computation of interval estimates
A credible/confidence interval is constructed by means of a Bayesian/likelihood procedure, combined with the percentile non-parametric bootstrap of [8]. This bootstrapping technique has been adapted to the clustered data setting, i.e. resampling by cluster and always selecting the same number of clusters in each bootstrap sample [9]. When the failure time model is fitted using likelihood or a Bayesian approach, the censoring model needs to be fitted using the same approach. Even if credible intervals can be readily obtained for the Bayesian procedure from the posterior distribution of the respective concordance probabilities, the bootstrap technique is necessary to ensure that the resulting 95% credible interval captures the uncertainty of the estimation of the model parameters for future values of the observed covariates [1]. The description of both procedures using the a BC method [8], which is a more refined version of the percentile non-parametric bootstrap method, can be found in Section B of the online supplementary material. Since the developments for the three measures are the same, we denote the measure of interest as

Internal validation
In practice, the same data is often used for developing the failure time model and measuring the predictive ability, hereby generating possibly overoptimistic estimates of the predictive ability. One can correct for this by applying an adaptation of the .632 and .632+ bootstrap cross validation procedures of the Brier score for univariate failure time models to the clustered setting, proposed by [5]. Let Ĉ τ be the (overoptimistic) point estimate of the measure, the procedure then amounts to: Additional developments are presented here for BC C τ , BM C τ and w C τ Since the developments for the three measures are the same, we denote the measure of interest as . C τ a.
Split the sample randomly in training and a validation set. Make sure that the training (validation) set contains approximatively 63.2% (36.8%) of the subjects of the original sample. b.
Repeat the same modeling steps in the training sample as in the original data set. Obtain estimates for the model parameters. c.
Calculate a point estimate of the measure boot C τ on the subjects of the validation set only.

d.
Repeat the first three steps of this procedure times with reasonably large.
Note that step a) of the upper procedure is different from the step a) of the classical procedure [5,11]  C τ + [5]. The relative over fitting rateR τ is: Where, NoInf C τ corresponds to the no-information error rate of the restricted concordance probability for which the survival status is independent from the covariates .
x As such, for each subject i of cluster , j its contribution to is computed by averaging over all the observed values of the data set or: When R τ is lower than 0 (higher than 1), R τ is fixed to 0 (1). The .632+ bootstrap cross validation estimate will differ more from the .632 bootstrap cross validation estimate as R τ gets closer to 1. Further, [11] have found that the .632+ bootstrap cross validation estimate performs generally better than the .632 bootstrap cross validation estimate. An estimate of the frailty term of each of the clusters of the data set is needed to estimate . BC C τ However, it is possible that some clusters of the original data set are not present in the training set of step a) of the above procedure such that in the validation set of step c) no point estimates can be calculated for these subjects. Therefore, we propose to calculate BC C τ in step c) only for those subjects whose cluster was present in the training set of step a). Note that it is assumed that BC C τ cannot be calculated for only a minority of the clusters of the validation data set such that the 63.2/36.8 proportion (training/validation) is not too strongly disrupted.
A too strong disruption of this proportion invalidates the upper internal validation scheme. A credible/confidence interval for the internally validated point estimate can be constructed by combining the above procedure and the one of Section 2.2.4. More specifically, we start by sampling K bootstrap data sets. For each of these bootstrap data sets, the upper procedure is

Simulation study
After describing the design of the simulation study, the properties of the Bayesian/likelihood point estimates and the percentile credible/confidence intervals of

Design of the simulation study
In this section we provide details on the simulation study. A gamma frailty proportional hazard (PH) model with a Weibull baseline hazard (shape The fitted failure time model of the semi-parametric model corresponds to a classical Cox PH model with a partial likelihood estimation procedure [12] for the likelihood procedure and to a gamma independent increments model for the baseline hazards and a PH specification for the covariates [13] for the Bayesian procedure. Since no covariates influence the censoring process, the Kaplan-Meier model of [14] is chosen as the censoring model for the likelihood procedure and the gamma-independent increments model of [13] for the Bayesian procedure. The empirical bias is calculated as the averaged difference between the true and the estimated concordance probabilities and the MSE as the mean of the squared differences between the true and estimated concordance probabilities. A positive (negative) empirical bias is defined as any estimate greater (smaller) than the true value of   Table 3: Simulation study: width of the credible/confidence intervals of BC C τ , BM C τ and W C τ using the percentile method estimated by the Bayesian and likelihood method, investigated for varying degrees of censoring ('0%', '25%', '50%' and '80%') and cluster size ('2', '5', '10' or '20'    as the cluster size increases (decreases) and/or the censoring percentage increases ( Table 3). The semi-parametric likelihood approach shows a similar behavior as the parametric likelihood approach.

Additional simulation results for the parametric model
The estimation of BC C τ by the Bayesian method shows no significant empirical bias, except when the intra-cluster correlation is very small, the cluster size very small and the censoring percentage very high. The latter empirical bias is provoked by an overestimation of the frailty variance parameter. For the likelihood method, BC C τ shows a positive empirical bias which increases as τ and the intra-cluster correlation increases, but which is unaffected by the sample size. The estimation properties of BM C τ and W C τ by the Bayesian and likelihood method are not influenced by , the intra-cluster correlation or the sample size. The quality of the credible/ confidence intervals of the percentile and a BC the bootstrap method are found to be very similar. Note that the estimate of the accelerationâ τ of the a BC non-parametric bootstrap method was found to be very small in every investigated scenarios, such that we recommend to fix theâ τ to zero.

Effect of misspecification
In this section, we show the effect of misspecifying the frailty distribution and the censoring model on the estimation properties of , BC C τ however, is in general strongly affected by misspecification, resulting in a positive (negative) bias when the empirical distribution of the estimated frailty terms ŵ is more (less) variable than the distribution of the true frailty terms .
w Similar results were found by [15].

Misspecification of the censoring model
A functionally misspecified covariate effect or a misspecified censoring model will result in a negative bias, especially for high censoring percentages (50% and 80% in our simulation study).
Covariates that have no impact on the censoring process will not influence the quality of the estimation when included in the censoring model. Omitting important variables, however, does result in a negative bias. In practice, several equally performing censoring models can be proposed. This is however not a problem since the censoring model is merely used as a working model. Note however it is easier to detect misspecifications of the censoring model as the censoring percentage increases since, in contrast to the failure time model, a higher censoring percentage results in an increase of information for the censoring model. Similar results were found by [16].

Simulation results for the
Due to the misspecification of the frailty distribution, the estimated frailty terms differ from the true frailty terms, hereby not only disrupting the bias patterns of ( ),

Application to the amalgam data set
In this data set, the effect of different treatment modalities on the longevity of different types of amalgam restorations is investigated. The primary covariates of the study are 4 cavity wall treatments ('CSA', 'CWF', 'Copalite' and 'Silver Suspension') and the alloy of the amalgam (levels: 'NTD', 'Tytin' and 'CNG'). The secundary covariates are the type of the restoration (levels: 'MO/DO' and 'MOD'), the type of tooth (levels: 'premolar' and 'molar'), the position of the tooth in the mouth (levels: 'upper left', 'upper right', 'lower left' and 'lower right'), the operator (levels: 'operator 1', 'operator 2' and 'operator 3'), gender and age. The data set is composed of three clinical trials and in each clinical trial the primary variables were combined into 4 treatment modalities assigning each modality randomly to 4 (or 8 or 12) teeth in a 2 x 2 (factorial) design within patient [17]. Note that this study was mainly designed to evaluate significance of the primary covariates. In this paper we will evaluate the predictive potential of all the primary covariates. Since the early failures are suspected to be of a different nature than the late failures and since there are only a few early failures, we restrict our analysis to the late failures only, i.e. failures that occur at least 9 years after enrollment in the study. This results in 174 patients contributing with 1347 amalgam restorations. The clustered data structure is unbalanced, with 4, 8 and 12 restorations seen in 36, 89 and 33 patients, respectively. The median follow-up time is 14.48 and the true failure time is observed for 187 amalgam restorations only, leading to a censoring percentage of 86.2%.
Based on the results of the simulation studies of Section 3.1, we have chosen to use the Bayesian gamma frailty generalized gamma model as the failure time model, which is a very flexible parametric Bayesian model [18]. Note that the failure time model consists of the primary covariates only. Different censoring models for the IPCW weights were tested with the Bayesian generalized gamma model of both primary and secondary covariates resulting into the highest concordance probability estimates. We evaluate the model's discriminatory ability from 9 to 15 years after enrollment in the study, since, as recommended by [19], censoring is not too heavy at the end of this time interval. The +.632 bootstrap cross validation estimators are chosen for the internal validation of the considered concordance probabilities. The R code of this analysis but applied to fictive data can be found in the online supplementary material.

Analysis using the full information
The apparent and validated point estimate and 95% percentile credible interval of the C τ measures (B = 100, K = 100, 6 τ = ) are shown in Table 4. Since is chosen to be 6, all comparable pairs of the [9,15] time interval are considered. Little difference with the values in Table 4 is found for ( ) C d τ measures using 6 τ = and d=6. We see that the covariates have practically no intra-cluster predictive ability but a strong inter-cluster predictive ability. Further, even after validation, the between conditional concordance probability improves reasonably strongly on the between marginal concordance probability implying the presence of a strong clustering effect upon the covariate effects. Thus, if one succeeds in a future study to fully capture the clustering effect in covariates and if the primary variables are also included in this future model, the between marginal concordance probability can potentially attain a value of 0.74. Note that the drop in predictive ability due to the internal validation is more pronounced for the between conditional concordance probability than for the between marginal and within concordance probability. All the apparent and validated 95% percentile credible intervals show a strong overlap and the upper bound of the validated 95% percentile credible intervals are all lower than the upper bound of the apparent 95% percentile credible intervals. Surprisingly, the lower bound of the validated 95% percentile credible intervals of the within and the between marginal measures are higher than the upper bound of the apparent 95% percentile credible intervals.

Analyzing τ and d − varying patterns
In Figure  are easier (more difficult) to discriminate from later failing teeth than later failing teeth are. As a result, the difference between the validated curves of BC C τ and BM C τ mildly decreases as τ increases such that the clustering effect induces a stronger gain in predictive ability for the earlier failing teeth as compared to the later failing teeth. Note that W C τ even attains values lower than 0.5 forτ values lower than 14, meaning that the ranking of the predictions are systematically in the wrong direction for teeth belonging to this τ range. In the right panel of Figure   3, a monotone increasing pattern is depicted for both

Discussion
In this article, the methodology proposed in [1] is extended to general frailty models. Also, a new definition of the concordance probability is developed as well as an internal validation procedure and a procedure to calculate interval estimates in the presence of clustered data. Note that all these developments can also be applied to univariate survival models, we just focused on the frailty model in this article. Other very useful measures can be extended to the frailty model in a similar manner [19][20][21][22]. We chose the concordance probability since it is most widely used measure for the discriminatory ability of survival data. From the simulation study, we learned that it is of mayor importance to model the censoring model correctly. Despite of this possible shortcoming, the IPCW estimation method is up to now, to our knowledge, the best method to estimate the concordance probability in the presence of censoring. An alternative estimation method could be based on a non-parametric imputation method based on the covariate distribution, in which the censored failure times are replaced by true failure times whose covariate information is close to the one of the censored observation. This estimation would be, in spirit, similar to what has been done by [23] for the estimation of survival models.
The between conditional concordance probability estimated by the likelihood approach suffers from a negative bias, especially for a small cluster size and a high censoring percentage. By means of a small simulation study, we could determine that this overoptimistic bias is caused by the empirical Bayes estimates of the frailty terms since the overoptimistic bias of the between conditional concordance probability disappeared once the frailty terms were replaced by their population values. Note that similar results were found in [1] and [24]. An external validation of the between conditional concordance probability can only be performed when the same clusters are used for the external data set as for the training data set. Indeed, only under these conditions estimates of the frailty terms for the clusters of the external data set are available, indispensable to compute the between conditional concordance probability of the external data set. Note that this only emphasizes the importance of an internal validation procedure for the between conditional concordance probability. A different solution to this problem could be the estimation of the frailty term of a new cluster, but this approach was not considered here. In this article we extended the concordance probability for frailty models with a 2-level clustering structure, since this is the most common clustering structure in survival analysis [7]. In practice however, more complicated clustering structures may be encountered. In [24] for example, it is shown how the concordance probability and the Brier score can be extended for nested 2-level and 3-level multilevel binary regression models.